From ec7d869fe1411b03b63223bf7acd90f74c1360e3 Mon Sep 17 00:00:00 2001 From: Alistair Coles Date: Thu, 24 Jul 2014 15:47:24 +0100 Subject: [PATCH] Updateable Object Sysmeta The goal of this work is to enable object system metadata to be persisted AND updated by 'fast-POST' requests. Unlike user metadata, it should be possible to update individual items of system metadata independently when making a POST request to an object server. Change-Id: I746ba547e0952d03ab9d949ea1a2e13d8b90c16a --- specs/swift/updateable-obj-sysmeta.rst | 270 +++++++++++++++++++++++++ 1 file changed, 270 insertions(+) create mode 100644 specs/swift/updateable-obj-sysmeta.rst diff --git a/specs/swift/updateable-obj-sysmeta.rst b/specs/swift/updateable-obj-sysmeta.rst new file mode 100644 index 0000000..c683772 --- /dev/null +++ b/specs/swift/updateable-obj-sysmeta.rst @@ -0,0 +1,270 @@ +:: + + This work is licensed under a Creative Commons Attribution 3.0 + Unported License. + http://creativecommons.org/licenses/by/3.0/legalcode + +.. + This template should be in ReSTructured text. Please do not delete + any of the sections in this template. If you have nothing to say + for a whole section, just write: "None". For help with syntax, see + http://sphinx-doc.org/rest.html To test out your formatting, see + http://www.tele3.cz/jbar/rest/rest.html + +========================= +Updateable Object Sysmeta +========================= + +The original system metadata patch ( https://review.openstack.org/#/c/51228/ ) +supported only account and container system metadata. + +There are now patches in review that store middleware-generated metadata +with objects, e.g.: + +* on demand migration https://review.openstack.org/#/c/64430/ +* server side encryption https://review.openstack.org/#/c/76578/1 + +Object system metadata should not be stored in the x-object-meta- user +metadata namespace because (a) there is a potential name conflict with +arbitrarily named user metadata and (b) system metadata in the x-object-meta- +namespace will be lost if a user sends a POST request to the object. + +A patch is under review ( https://review.openstack.org/#/c/79991/ ) that will +persist system metadata that is included with an object PUT request, +and ignore system metadata sent with POSTs. + +The goal of this work is to enable object system metadata to be persisted +AND updated. Unlike user metadata, it should be possible to update +individual items of system metadata independently when making a POST request +to an object server. + +This work applies to fast-POST operation, not POST-as-copy operation. + +Problem Description +=================== + +Item-by-item updates to metadata can be achieved by simple changes to the +metadata read-modify-write cycle during a POST to the object server: read +system metadata from existing data or meta file, merge new items, +write to a new meta file. However, concurrent POSTs to a single server or +inconsistent results between multiple servers can lead to multiple meta +files containing divergent sets of system metadata. These must be preserved +and eventually merged to achieve eventual consistency. + +Proposed Change +=============== + +The proposed new behavior is to preserve multiple meta files in the obj_dir +until their system metadata is known to have been read and merged into a +newer meta file. + +When constructing a diskfile object, all existing meta files that are newer +that the data file (usually just one) should be read for potential system +metadata contributions. To enable a per-item most-recent-wins semantic when +merging contributions from multiple meta files, system metadata should be +stored in meta files as `key: (value, timestamp)` pairs. This is not +necessary when system metadata is stored in a data file because the +timestamp of those items is known to be that of the data file. + +When writing the diskfile during a POST, the merged set of system metadata +should be written to the new meta file, after which the older meta files can +be deleted. + +This requires a change to the diskfile cleanup code (`hash_cleanup_listdir`). +After creating a new meta file, instead of deleting all older meta files, +only those that were either older than the data file or read during +construction of the new meta file are deleted. + +In most cases the result will be same, but if a second concurrent request +has written a meta file that was not read by the first request handler then +this meta file will be left in place. + +Similarly, a change is required in the async cleanup process (called by the +replicator daemon). The cleanup process should merge any existing meta files +into the most recent file before deleting older files. To reduce workload, +this merge process could be conditional upon a threshold number of meta +files being found. + +Replication considerations +-------------------------- + +As a result of failures, object servers may have different existing meta +files for an object when a POST is handled and a new (merged) metadata set +is written to a new meta file. Consequently, object servers may end up with +identically timestamped meta files having different system metadata content. + +rsync: + +To differentiate between these meta files it is proposed to include a hash +of the metadata content in the name of the meta file. As a result, +meta files with differing content will be replicated between object servers +and their contents merged to achieve eventual consistency. + +The timestamp part of the meta filename is still required in order to (a) +allow meta files older than a data or tombstone file to be deleted without +being read and (b) to continue to record the modification time of user +metadata. + +ssync - TBD + +Deleting system metadata +------------------------ + +An item of system metadata with key `x-object-sysmeta-x` should be deleted +when a header `x-object-sysmeta-x:""` is included with a POST request. This +can be achieved by persisting the system metadata item in meta files with an +empty value, i.e. `key : ("", timestamp)`, to indicate to any future metadata +merges that the item has been deleted. This guards against inclusion of +obsolete values from older meta files at the expense of storing the empty +value. The empty-valued system metadata may be finally removed during a +subsequent merge when it is observed that some expiry time has passed since +its timestamp (i.e. any older value that the empty value is overriding would +have been replicated by this time, so it is safe to delete the empty value). + +Example +------- + +Consider the following scenario. Initially the object dir on each object +server contains just the original data file:: + + obj_dir: + t1.data: + x-object-sysmeta-p: ('p1', t0) + +Two concurrent POSTs update the object on servers A and B, +with timestamps t2 and t3, but fail on server C. One POST updates +`x-object-sysmeta-p` and adds `x-object-sysmeta-y`. The other POST adds +`x-object-sysmeta-z`. These POSTs result in two meta files being added to the +object directory on A and B:: + + obj_dir: + t1.data: + x-object-sysmeta-p: ('p1', t0) + t2.h2.meta: + x-object-sysmeta-p: ('p2', t2) + x-object-sysmeta-x: ('x1', t2) + x-object-sysmeta-y: ('y1', t2) + t3.h3.meta: + x-object-sysmeta-p: ('p1', t0) + x-object-sysmeta-x: ('x2', t3) + x-object-sysmeta-z: ('z1', t3) + +(`hx` in filename represents hash of metadata) + +A response to a subsequent HEAD request would contain the composition of the +two meta files' system metadata items:: + + x-object-sysmeta-p: 'p2' + x-object-sysmeta-x: 'x2' + x-object-sysmeta-y: 'y1' + x-object-sysmeta-z: 'z1' + +A further POST request received at t4 deletes `x-object-sysmeta-p`. This +causes the two meta files to be read, their contents merged and a new meta +file to be written. This POST succeeds on all servers, +so on servers A and B we have:: + + obj_dir: + t1.data : + x-object-sysmeta-p: ('p1', t0) + t4.h4a.meta: + x-object-sysmeta-p: ('', t4) + x-object-sysmeta-x: ('x3', t3) + x-object-sysmeta-z: ('z1', t3) + x-object-sysmeta-y: ('y1', t2) + +whereas on server C we have:: + + obj_dir: + t1.data : + x-object-sysmeta-p: ('p1', t0) + t4.h4b.meta: + x-object-sysmeta-p: ('', t4) + +Eventually the meta files will be replicated between servers and merged, +leaving all servers with:: + + obj_dir: + t1.data : + x-object-sysmeta-p: ('p1', t0) + t4.h4a.meta: + x-object-sysmeta-p: ('', t4) + x-object-sysmeta-x: ('x3', t3) + x-object-sysmeta-z: ('z1', t3) + x-object-sysmeta-y: ('y1', t2) + +Alternatives +------------ + +One alternative approach would be to preserve all meta files that are newer +than a data or tombstone file and never merge their contents. This removes +the need to include a hash in the meta file name, but has the obvious +disadvantage of accumulating an increasing number of files, each of which +needs to be read when constructing a diskfile. + +Another alternative would store system metadata in separate `sysmeta` file. +It may then be possible to discard the timestamp from the filename (if the +`timestamp.hash` format is deemed too long). + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + Alistair Coles (acoles) + + +Work Items +---------- + +TBD + +Repositories +------------ + +None + +Servers +------- + +None + +DNS Entries +----------- + +None + +Documentation +------------- + +No change to external API docs. Developer docs would be updated to make +developers aware of the feature. + +Security +-------- + +None + +Testing +------- + +Additional unit tests will be required for diskfile.py, object server. Probe +tests will be useful to verify replication behavior. + +Dependencies +============ + +Patch for object system metadata on PUT only: + https://review.openstack.org/#/c/79991/ + +Spec for updating containers on fast-POST: + https://review.openstack.org/#/c/102592/ + +There is a mutual dependency between this spec and the spec to update +containers on fast-POST: the latter requires content-type to be treated as +an item of mutable system metadata, which this spec aims to enable. This +spec assumes that fast-POST becomes usable, which requires consistent +container updates to be enabled. \ No newline at end of file