Retire swift-specs

Depends-On: https://review.opendev.org/732999
Change-Id: I8acff8e7c07f3e0f599d86d503eb4b088c0f8521
This commit is contained in:
Tim Burke 2020-06-02 13:46:55 -07:00
parent dded73b91d
commit c591d46d2e
63 changed files with 7 additions and 6947 deletions

View File

@ -1,7 +0,0 @@
[run]
branch = True
source = swift-specs
omit = swift-specs/tests/*,swift-specs/openstack/*
[report]
ignore_errors = True

51
.gitignore vendored
View File

@ -1,51 +0,0 @@
*.py[cod]
# C extensions
*.so
# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg
lib
lib64
# Installer logs
pip-log.txt
# Unit test / coverage reports
.coverage
.tox
nosetests.xml
.testrepository
# Translations
*.mo
# Mr Developer
.mr.developer.cfg
.project
.pydevproject
# Complexity
output/*.html
output/*/index.html
# Sphinx
doc/build
# pbr generates these
AUTHORS
ChangeLog
# Editors
*~
.*.swp

View File

@ -1,3 +0,0 @@
# Format is:
# <preferred e-mail> <other e-mail 1>
# <preferred e-mail> <other e-mail 2>

View File

@ -1,7 +0,0 @@
[DEFAULT]
test_command=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-60} \
${PYTHON:-python} -m subunit.run discover -t ./ . $LISTOPT $IDOPTION
test_id_option=--load-list $IDFILE
test_list_option=--list

View File

@ -1,20 +0,0 @@
===========================
Contributing to swift-specs
===========================
HowToContribute
---------------
If you would like to contribute to the development of OpenStack,
you must follow the steps in this page:
http://docs.openstack.org/infra/manual/developers.html
GerritWorkFlow
--------------
Once those steps have been completed, changes to OpenStack
should be submitted for review via the Gerrit tool, following
the workflow documented at:
http://docs.openstack.org/infra/manual/developers.html#development-workflow

View File

@ -1,3 +0,0 @@
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode

View File

@ -1,5 +0,0 @@
include LICENCE
exclude .gitignore
exclude .gitreview
global-exclude *.pyc

View File

@ -1,82 +1,9 @@
========================
Team and repository tags
========================
This project is no longer maintained.
.. image:: http://governance.openstack.org/badges/swift-specs.svg
:target: http://governance.openstack.org/reference/tags/index.html
The contents of this repository are still available in the Git
source code management system. To see the contents of this
repository before it reached its end of life, please check out the
previous commit with ``git checkout HEAD^1``.
.. Change things from this point on
======================
Swift Specs Repository
======================
This archive is no longer active. Content is kept for historic purposes.
========================================================================
Documents in this repo are a collection of ideas. They are not
necessarily a formal design for a feature, nor are they docs for a
feature, nor are they a roadmap for future features.
This is a git repository for doing design review on enhancements to
OpenStack Swift. This provides an ability to ensure that everyone
has signed off on the approach to solving a problem early on.
Repository Structure
====================
The structure of the repository is as follows::
specs/
done/
in_progress/
Implemented specs will be moved to :ref:`done-directory`
once the associated code has landed.
The Flow of an Idea from your Head to Implementation
====================================================
First propose a spec to the ``in_progress`` directory so that it can be
reviewed. Reviewers adding a positive +1/+2 review in gerrit are promising
that they will review the code when it is proposed. Spec documents should be
approved and merged as soon as possible, and spec documents in the
``in_progress`` directory can be updated as often as needed. Iterate on it.
#. Have an idea
#. Propose a spec
#. Reviewers review the spec. As soon as 2 core reviewers like something,
merge it. Iterate on the spec as often as needed, and keep it updated.
#. Once there is agreement on the spec, write the code.
#. As the code changes during review, keep the spec updated as needed.
#. Once the code lands (with all necessary tests and docs), the spec can be
moved to the ``done`` directory. If a feature needs a spec, it needs
docs, and the docs must land before or with the feature (not after).
Spec Lifecycle Rules
====================
#. Land quickly: A spec is a living document, and lives in the repository
not in gerrit. Potential users, ops and developers will look at
http://specs.openstack.org/openstack/swift-specs/ to get an idea of what's
being worked on, so they need to be quick to land.
#. Initial version is an idea not a technical design: That way the merits of
the idea can be discussed and landed and not stuck in gerrit limbo land.
#. Second version is an overview of the technical design: This will aid in the
technical discussions amongst the community.
#. Subsequent versions improve/enhance technical design: Each of these
versions should be relatively small patches to the spec to keep rule #1. And
keeps the spec up to date with the progress of the implementation.
How to ask questions and get clarifications about a spec
========================================================
Naturally you'll want clarifications about the way a spec is written. To ask
questions, propose a patch to the spec (via the normal patch proposal tools)
with your question or your understanding of the confusing part. That will
raise the issue in a patch review and allow everyone to answer or comment.
Learn As We Go
==============
This is a new way of attempting things, so we're going to be low in
process to begin with to figure out where we go from here. Expect some
early flexibility in evolving this effort over time.
Historical content may still be viewed at
http://specs.openstack.org/openstack/swift-specs/

View File

@ -1,90 +0,0 @@
# -*- coding: utf-8 -*-
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
# -- General configuration ----------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = [
'sphinx.ext.autodoc',
'oslosphinx',
'yasfb',
]
# Feed configuration for yasfb
feed_base_url = 'http://specs.openstack.org/openstack/swift-specs'
feed_author = 'OpenStack Swift Team'
exclude_patterns = [
'**/test.rst',
'template_link.rst',
]
# Optionally allow the use of sphinxcontrib.spelling to verify the
# spelling of the documents.
try:
import sphinxcontrib.spelling
extensions.append('sphinxcontrib.spelling')
except ImportError:
pass
# autodoc generation is a bit aggressive and a nuisance when doing heavy
# text edit cycles.
# execute "export SPHINX_DEBUG=1" in your terminal to disable
# The suffix of source filenames.
source_suffix = '.rst'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = u'swift-specs'
copyright = u'%s, OpenStack Foundation' % datetime.date.today().year
# If true, '()' will be appended to :func: etc. cross-reference text.
add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
add_module_names = True
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# -- Options for HTML output --------------------------------------------------
# The theme to use for HTML and HTML Help pages. Major themes that come with
# Sphinx are currently 'default' and 'sphinxdoc'.
# html_theme_path = ["."]
# html_theme = '_theme'
# html_static_path = ['static']
# Output file base name for HTML help builder.
htmlhelp_basename = '%sdoc' % project
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, documentclass
# [howto/manual]).
latex_documents = [
('index',
'%s.tex' % project,
u'%s Documentation' % project,
u'OpenStack Foundation', 'manual'),
]

View File

@ -1 +0,0 @@
.. include:: ../../CONTRIBUTING.rst

View File

@ -1,41 +0,0 @@
Swift Design Specifications
===========================
This archive is no longer active. Content is kept for historic purposes.
========================================================================
Documents in this repo are a collection of ideas. They are not
necessarily a formal design for a feature, nor are they docs for a
feature, nor are they a roadmap for future features.
.. toctree::
:glob:
:maxdepth: 1
specs/in_progress/*
Specifications Repository Information
=====================================
.. toctree::
:glob:
:maxdepth: 2
*
Archived Specs
==============
.. toctree::
:glob:
:maxdepth: 1
specs/done/*
Indices and tables
==================
* :ref:`genindex`
* :ref:`search`

View File

@ -1 +0,0 @@
.. include:: ../../README.rst

View File

@ -1 +0,0 @@
../../specs/

View File

@ -1 +0,0 @@
.. include:: ../../template.rst

View File

@ -1,3 +0,0 @@
oslosphinx
sphinx>=1.1.2,<1.2
yasfb>=0.5.1

View File

@ -1,25 +0,0 @@
[metadata]
name = swift-specs
summary = OpenStack Swift Development Specifications
description-file =
README.rst
author = OpenStack
author-email = openstack-dev@lists.openstack.org
home-page = http://www.openstack.org/
classifier =
Environment :: OpenStack
Intended Audience :: Developers
Operating System :: POSIX :: Linux
[build_sphinx]
source-dir = doc/source
build-dir = doc/build
all_files = 1
[pbr]
warnerrors = True
skip_authors = True
skip_changelog = True
[upload_sphinx]
upload-dir = doc/build/html

View File

@ -1,22 +0,0 @@
#!/usr/bin/env python
# Copyright (c) 2013 Hewlett-Packard Development Company, L.P.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# THIS FILE IS MANAGED BY THE GLOBAL REQUIREMENTS REPO - DO NOT EDIT
import setuptools
setuptools.setup(
setup_requires=['pbr'],
pbr=True)

View File

@ -1,15 +0,0 @@
.. _done-directory:
The ``Done`` Directory
======================
This directory in the specs repo is where specs are moved once the
associated code patch has been merged into its respective repo.
Historical Reference
--------------------
A spec document in this directory is meant only for historical
reference, it does not equate to docs for the feature. Swift's
documentation for implemented features is published
`here <http://docs.openstack.org/developer/swift/>`_.

View File

@ -1,872 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
====================
Erasure Code Support
====================
This is a living document to be updated as the team iterates on the design
therefore all details here reflect current thinking however are subject to
change as development progresses. The team makes use of Trello to track
more real-time discussions activities that, as details/thoughts emerge, are
captured in this document.
The Trello discussion board can be found at this `link. <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_
Major remaining tasks are identified by a number that can be found in a corresponding Trello card. Outstanding
tasks are listed at the end of each section in this document As this doc is updated and/or Trello cards are
completed please be sure to update both places.
WIP Revision History:
* 7/25, updated meta picture, specify that object metadata is system, redo reconstructor section
* 7/31, added traceability to trello cards via section numbers and numbered task items, added a bunch of sections
* 8/5, updated middleware section, container_sync section, removed 3.3.3.7 as dup, refactoring section, create common interface to proxy nodes, partial PUT hency on obj sysmet patch, sync'd with trello
* 8/23, many updates to reconstructor section based on con-call from 8/22. Also added notes about not deleting on PUT where relevant and updated sections referencing closed Trello cards
* 9/4, added section in reconstructor on concurrency
* 10/7, reconstructor section updates - lots of them
* 10/14, more reconstructor section updates, 2 phase commit intro - misc typos as well from review
* 10/15, few clarifications from F2F review and bigger rewording/implementation change for what was called 2 phase commit
* 10/17, misc clarifying notes on .durable stuff
* 11/13: IMPORANT NOTE: Several aspects of the reconstructor are being re-worked; the section will be updated ASAP
* 12/16: reconstructor updates, few minor updates throughout.
* 2/3: reconstructor updates
* 3/23: quick scrub to bring things in line w/current implementation
* 4/14: Ec has been merged to master. Some parts of this spec are no longer the authority on the design, please review code on master and user documentation.
1. Summary
----------
EC is implemented in Swift as a Storage Policy, see `docs <http://docs.openstack.org/developer/swift/overview_policies.html>`_
for complete details on Storage Policies.
EC support impacts many of the code paths and background operations for data stored in a
container that was created with an EC policy however this is all transparent to users of
the cluster. In addition to fully leveraging the Storage Policy framework, the EC design
will update the storage policy classes such that new policies, like EC, will be sub
classes of a generic base policy class. Major code paths (PUT/GET) are updated to
accommodate the different needs to encode/decode versus replication and a new daemon, the
EC reconstructor, performs the equivalent jobs of the replicator for replication
processes. The other major daemons remain, for the most part, unchanged as another key
concept for EC is that EC fragments (see terminology section below) are seen as regular
objects by the majority of services thus minimizing the impact on the existing code base.
The Swift code base doesn't include any of the algorithms necessary to perform the actual
encoding and decoding of data; that is left to an external library. The Storage Policies
architecture is leveraged to allow EC on a per container basis and the object rings still
provide for placement of EC data fragments. Although there are several code paths that are
unique to an operation associated with an EC policy, an external dependency to an Erasure Code
library is what Swift counts on to perform the low level EC functions. The use of an external
library allows for maximum flexibility as there are a significant number of options out there,
each with its owns pros and cons that can vary greatly from one use case to another.
2. Problem description
======================
The primary aim of EC is to reduce the storage costs associated with massive amounts of data
(both operating costs and capital costs) by providing an option that maintains the same, or
better, level of durability using much less disk space. See this `study <http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-amplidata-storage-paper.pdf>`_
for more details on this claim.
EC is not intended to replace replication as it may not be appropriate for all usage models.
We expect some performance and network usage tradeoffs that will be fully characterized once
sufficient code is in place to gather empirical data. Current thinking is that what is typically
referred to as 'cold storage' is the most common/appropriate use of EC as a durability scheme.
3. Proposed change
==================
3.1 Terminology
-----------------
The term 'fragment' has been used already to describe the output of the EC process (a series of
fragments) however we need to define some other key terms here before going any deeper. Without
paying special attention to using the correct terms consistently, it is very easy to get confused
in a hurry!
* segment: not to be confused with SLO/DLO use of the work, in EC we call a segment a series of consecutive HTTP chunks buffered up before performing an EC operation.
* fragment: data and parity 'fragments' are generated when erasure coding transformation is applied to a segment.
* EC archive: A concatenation of EC fragments; to a storage node this looks like an object
* ec_k: number of EC data fragments (k is commonly used in the EC community for this purpose)
* ec_m: number of EC parity fragments (m is commonly used in the EC community for this purpose)
* chunk: HTTP chunks received over wire (term not used to describe any EC specific operation)
* durable: original data is available (either with or without reconstruction)
* quorum: the minimum number of data + parity elements required to be able to guarantee the desired fault tolerance, which is the number of data elements supplemented by the minimum number of parity elements required by the chosen erasure coding scheme. For example,for Reed-Soloman, the minimum number parity elements required is 1, and thus the quorum_size requirement is ec_ndata + 1. Given the number of parity elements required is not the same for every erasure coding scheme, consult PyECLib for min_parity_fragments_needed()
* fully durable: all EC archives are written and available
3.2 Key Concepts
----------------
* EC is a Storage Policy with its own ring and configurable set of parameters. The # of replicas for an EC ring is the total number of data plus parity elements configured for the chosen EC scheme.
* Proxy server buffers a configurable amount of incoming data and then encodes it via PyECLib we called this a 'segment' of an object.
* Proxy distributes the output of the encoding of a segment to the various object nodes it gets from the EC ring, we call these 'fragments' of the segment
* Each fragment carries opaque metadata for use by the PyECLib
* Object metadata is used to store meta about both the fragments and the objects
* An 'EC Archive' is what's stored on disk and is a collection of fragments appended
* The EC archives container metadata contains information about the original object, not the EC archive
* Here is a 50K foot overview:
.. image:: images/overview.png
3.3 Major Change Areas
----------------------
**Dependencies/Requirements**
See template section at the end
3.3.1 **Storage Policy Classes**
The feature/ec branch modifies how policies are instantiated in order to
Support the new EC policy.
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section:
3.3.1.2: Make quorum a policy based function (IMPLEMENTED)
3.3.2 **Middleware**
Middleware remains unchanged. For most middleware (e.g., SLO/DLO) the fact that the
proxy is fragmenting incoming objects is transparent. For list endpoints, however, it
is a bit different. A caller of list endpoints will get back the locations of all of
the fragments. The caller will be unable to re-assemble the original object with this information,
however the node locations may still prove to be useful information for some applications.
3.3.3 **Proxy Server**
Early on it did not appear that any major refactoring would be needed
to accommodate EC in the proxy, however that doesn't mean that its not a good
opportunity to review what options might make sense right now. Discussions have included:
* should we consider a clearer line between handing incoming requests and talking to the back-end servers?
Yes, it makes sense to do this. There is a Trello card tracking this work and it covered in a section later below.
* should the PUT path be refactored just because its huge and hard to follow?
Opportunistic refactoring makes sense however its not felt that it makes sense to
combine a full refactor of PUT as part of this EC effort. YES! This is active WIP.
* should we consider different controllers (like an 'EC controller')?
Well, probably... YES This is active WIP.
The following summarizes proxy changes to support EC:
*TODO: there are current discussion underway on Trello that affect both of these flows*
**Basic flow for a PUT:**
#. Proxy opens (ec_k + ec_m) backend requests to object servers
#. Proxy buffers HTTP chunks up-to a minimum segment size (defined at 1MB to start with)
#. Proxy feeds the assembled segment to PyECLib's encode() to get ec_k + ec_m fragments
#. Proxy sends the (ec_k + ec_m) fragments to the object servers to be _appended_ to the previous set
#. Proxy then continues with the next set of HTTP chunks
#. Object servers store objects which are EC archives (their contents are the concatenation of erasure coded fragments)
#. Object metadata changes: for 'etag', we store the md5sum of the EC archive object, as opposed to the non-EC case where we store md5sum of the entire object
#. Upon quorum of response and some minimal (2) number of commit confirmations, responds to client
#. Upon receipt of the commit message (part of a MIME conversation) storage nodes store 0 byte data file as timestamp.durable for respective object
**Proxy HTTP PUT request handling changes**
#. Intercept EC request based on policy type
#. Validate ring replica count against (ec_k + ec_m)
#. Calculate EC quorum size for min_conns
#. Call into PyEClib to encode to client_chunk_size sized object chunks to generate (ec_k + ec_m) EC fragments.
#. Queue chunk EC fragments for writing to nodes
#. Introduce Multi-phase Commit Conversation
**Basic flow for a GET:**
#. Proxy opens ec_k backend concurrent requests to object servers. See Trello card 3.3.3.3
#. Proxy would 1) validates the number of successful connections >= ec_k 2) checks the avaiable fragment archives responsed by obj-server are the same version.
3) continue searching from the hand-off nodes (ec_k + ec_m) if not enough data found. See Trello card 3.3.3.6
#. Proxy reads from the first ec_k fragment archives concurrently.
#. Proxy buffers the content to a segment up-to the minimum segment size.
#. Proxy feeds the assembled segment to PyECLib's decode() to get the original content.
#. Proxy sends the original content to Client.
#. Proxy then continues with the next segment of contents.
**Proxy HTTP GET request handling changes**
*TODO - add high level flow*
*Partial PUT handling*
NOTE: This is active WIP on trello.
When a previous PUT fails in the middle, for whatever reason and regardless of how the response
was sent to the client, there can be various scenarios at the object servers that require the
proxy to make some decisions about what to do. Note that because the object servers will not
return data for .data files that don't have a matching .durable file, its not possible for
the proxy to get un-reconstrucable data unless there's a combination of a partial PUT and
a rebalance going on (or handoff scenario). Here are the basic rules for the proxy when it
comes to interpreting its responses when they are mixed::
If I have all of one timestamp, feed to PyECLib
If PYECLib says OK
I'm done, move on to next segment
Else
Fail the request (had sufficient segments but something bad happened)
Else I have a mix of timestamps;
Because they all have to be recosntructable, choose the newest
Feed to PYECLib
If PYECLib says OK
Im done, move on to next segment
Else
Its possible that the newest timestamp I chose didn't have enough segments yet
because, although each object server claims they're reconstructable, maybe
a rebalance or handoff situation has resulted in some of those .data files
residing elsewhere right now. In this case, I want to look into the
available timestamp headers that came back with the GET and see what else
is reconstructable and go with that for now. This is really a corner case
because we will restrict moving partitions around such that enough archives
should be found at any given point in time but someone might move too quickly
so now the next check is...
Choose the latest available timestamp in the headers and re-issue GET
If PYECLib says OK
I'm done, move on to next segment
Else
Fail the request (had sufficient segments but something bad happened) or
we can consider going to the next latest header....
**Region Support**
For at least the initial version of EC, it is not recommended that an EC scheme span beyond a
single region, Neither performance nor functional validation will be been done in in such
a configuration.
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.3.5: CLOSED
* 3.3.3.9: Multi-Phase Commit Conversation
In order to help solve the local data file cleanup problem, a multi-phase commit scheme is introduced
for EC PUT operations (last few steps above). The implementation will be via MIME documents such that
a conversation between the proxy and the storage nodes is had for every PUT. This provides us with the
ability to handle a PUT in one connection and assure that we have "the essence" of a 2 phase commit,
basically having the proxy communicate back to the storage nodes once it has confirmation that all
fragment archives in the set have been committed. Note that we still require a quorum of data elements
of the conversation to complete before signaling status to the client but we can relax that requirement
for the commit phase such that only 2 confirmations to that phase of the conversation are required for
success. More will be said about this in the reconstructor section.
Now the storage node has a cheap indicator of the last known durable set of fragment archives for a given
object on a successful durable PUT. The reconstructor will also play a role in the managing of the
.durable files, either propagating it or creating one post-reconstruction. The presence of a ts.durable
file means, to the object server, "there is a set of ts.data files that are durable at timestamp ts."
See reconstructor section for more details and use cases on .durable files. Note that the completion
of the commit phase of the conversation is also a signal for the object server to go ahead and immediately
delete older timestamp files for this object (for EC they are not immediately deleted on PUT). This is
critical as we don't want to delete the older object until the storage node has confirmation from the
proxy, via the multi-phase conversation, that the other nodes have landed enough for a quorum.
On the GET side, the implication here is that storage nodes will return the TS with a matching .durable
file even if it has a newer .data file. If there exists a .data file on one node without a .durable file but
that same timestamp has both a .data and a .durable on another node, the proxy is free to use the .durable
timestamp series as the presence of just one .durable in the set indicates that the object has integrity. In
the even that a serires of .data files exist without a .durable file, they will eventually be deleted by the
reconstructor as they will be considered partial junk that is unreconstructable (recall that 2 .durables
are required for determining that a PUT was successful).
Note that the intention is that this section/trello card covers the multi-phase commit
implementation at both proxy and storage nodes however it doesn't cover the work that
the reconstructor does with the .durable file.
A few key points on the .durable file:
* the .durable file means "the matching .data file for this has sufficient fragment archives somewhere, committed, to reconstruct the object"
* the proxy server will never have knowledge (on GET or HEAD) or the existence of a .data file on an object server if it doesn't have a matching .durable file
* the object server will never return a .data that doesn't have a matching .durable
* the only component that messes with .data files that don't have matching .durable files is the reconstructor
* when a proxy does a GET, it will only receive fragment archives that have enough present somewhere to be reconstructed
3.3.3.8: Create common interface for proxy-->nodes
NOTE: This ain't gonna happen as part of the EC effort
Creating a common module that allows for abstracted access to the a/c/s nodes would not only clean up
much of the proxy IO path but would also prevent the introduction of EC from further
complicating, for example, the PUT path. Think about an interface that would let proxy code
perform generic actions to a back-end node regardless of protocol. The proposed API
should be updated here and reviewed prior to implementation and its felt that it can be done
in parallel with existing EC proxy work (no dependencies, that work i small enough it can
be merged).
3.3.3.6: Object overwrite and PUT error handling
What's needed here is a mechanism to assure that we can handle partial write failures. Note: in both cases the client will get a failure back however without additional changes,
each storage node that saved a EC fragment archive will effectively have an orphan.
a) less than a quorum of nodes is written
b) quorum is met but not all nodes were written
and in both cases there are implications to both PUT and GET at both the proxy
and object servers. Additionally, the reconstructor plays a role here in cleaning up
and old EC archives that result from the scheme described here (see reconstructor
for details).
**High Level Flow**
* If storing an EC archive fragment, the object server should not delete older .data file unless it has a new one with a matching .durable.
* When the object server handles a GET, it needs to send header to the proxy that include all available timestamps for the .data file
* If the proxy determines is can reconstruct the object with the latest timestamp (can reach quorum) it proceeds
* If quorum cant be reached, find timestamp where quorum can be reached, kill existing connections (unless the body of that request was the found timestamp), and make new connections requesting the specific timestamp
* On GET, the object server needs to support requesting a specific timestamp (eg ?timestamp=XYZ)
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.3.1: CLOSED
* 3.3.3.2: Add high level GET flow
* 3.3.3.3: Concurrent connects to object server on GET path in proxy server
* 3.3.3.4: CLOSED
* 3.3.3.5: Region support for EC
* 3.3.3.6 EC PUTs should not delete old data files (in review)
* 3.3.3.7: CLOSED
* 3.3.3.8: Create common interface for proxy-->nodes
* 3.3.3.9: Multi-Phase Commit Conversation
3.3.4 **Object Server**
TODO - add high level flow
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.4.1: Add high level Obj Serv modifications
* 3.3.4.2: Add trailer support (affects proxy too)
3.3.5 **Metadata**
NOTE: Some of these metadata names are different in the code...
Additional metadata is part of the EC design in a few different areas:
* New metadata is introduced in each 'fragment' that is opaque to Swift, it is used by PyECLib for internal purposes.
* New metadata is introduced as system object metadata as shown in this picture:
.. image:: images/meta.png
The object metadata will need to be stored as system metadata.
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 5.1: Enable sysmeta on object PUT (IMPLEMENTED)
3.3.6 **Database Updates**
We don't need/want container updates to be sent out by every storage node
participating in the EC set and actually that is exactly how it will work
without any additional changes, see _backend_requests() in the proxy
PUT path for details.
3.3.7 **The Reconstructor**
**Overview**
The key concepts in the reconstructor design are:
*Focus on use cases that occur most frequently:*
#. Recovery from disk drive failure
#. Rebalance
#. Ring changes and revertible handoff case
#. Bit rot
* Reconstruction happens at the EC archive level (no visibility into fragment level for either auditing or reconstruction)
* Highly leverage ssync to gain visibility into which EC archive(s) are needed (some ssync mods needed, consider renaming the verb REPLICATION since ssync can be syncing in different ways now
* Minimal changes to existing replicator framework, auditor, ssync
* Implement as new reconstructor daemon (much reuse from replicator) as there will be some differences and we will want separate logging and daemon control/visibility for the reconstructor
* Nodes in the list only act on their neighbors with regards to reconstruction (nodes don't talk to all other nodes)
* Once a set of EC archives has been placed, the ordering/matching of the fragment index to the index of the node in the primary partition list must be maintained for handoff node usage
* EC archives are stored with their fragment index encoded in the filename
**Reconstructor framework**
The current implementation thinking has the reconstructor live as its own daemon so
that it has independent logging and controls. Its structure borrows heavily from
the replicator.
The reconstructor will need to do a few things differently than the replicator,
above and beyond the obvious EC functions. The major differences are:
* there is no longer the concept of 2 job processors that either sync or revert, instead there is a job pre-processor that figures out what needs to be done and one job processor carries out the actions needed
* syncs only with nodes to the left and right on the partition list (not with all nodes)
* for reversion, syncs with as many nodes as needed as determined by the fragment indexes that it is holding; the number of nodes will be equivalent to the number of unique fragment indexes that it is holding. It will use those indexes as indexes into the primary node list to determine which nodes to sync to.
**Node/Index Pairing**
The following are some scenarios that help explain why the node/fragment index pairing is so important for both of the operations just mentioned.
.. image:: images/handoff1.png
Next Scenario:
.. image:: images/handoff2.png
**Fragment Index Filename Encoding**
Each storage policy now must include a transformation function that diskfile will use to build the
filename to store on disk. This is required by the reconstructor for a few reasons. For one, it
allows us to store fragment archives of different indexes on the same storage node. This is not
hone in the happy path however is possible in some circumstances. Without unique filenames for
the different EC archive files in a set, we would be at risk of overwriting one archive of index
n with another of index m in some scenarios.
The transformation function for the replication policy is simply a NOP. For reconstruction, the index
is appended to the filename just before the .data extension. An example filename for a fragment
archive storing the 5th fragment would like this this::
1418673556.92690#5.data
**Diskfile Refactoring**
In order to more cleanly accomodate some of the low level on disk storage needs of EC (file names, .durable, etc)
diskfile has some additional layering introduced allowing those functions that need EC specific changes to be
isolated. TODO: Add detail here.
**Reconstructor Job Pre-processing**
Because any given suffix directory may contain more than one fragment archive index data file,
the actions that the reconstructor needs to take are not as simple as either syncing or reverting
data as is done with the replicator. Because of this, it is more efficient for the reconstructor
to analyze what needs to be done on a per part/suffix/fragment index basis and then schedules a
series of jobs that are executed by a single job processor (as opposed to having to clear scenarios
of sync and revert as with the replicator). The main scenarios that the pre-processor is
looking at:
#) part dir with all FI's matching the local node index this is the case where everything is where it belongs and we just need to compare hashes and sync if needed, here we sync with our partners
#) part dir with one local and mix of others here we need to sync with our partners where FI matches the lcoal_id , all others are sync'd with their home nodes and then killed
#) part dir with no local FI and just one or more others here we sync with just the FI that exists, nobody else and then all the local FAs are killed
So the main elements of a job that the job processor is handed include a list of exactly who to talk
to, which suffix dirs are out of sync and which fragment index to care about. Additionally the job
includes information used by both ssync and the reconstructor to delete, as required, .data files on
the source node as needed. Basically the work done by the job processor is a hybrid of what the
replicator does in update() and update_deleted().
**The Act of Reconstruction**
Reconstruction can be thought of sort of like replication but with an extra step
in the middle. The reconstructor is hard-wired to use ssync to determine what
is missing and desired by the other side however before an object sent over the
wire it needs to be reconstructed from the remaining fragments as the local
fragment is just that - a different fragment index than what the other end is
asking for.
Thus there are hooks in ssync for EC based policies. One case would be for
basic reconstruction which, at a high level, looks like this:
* ask PyECLib which nodes need to be contacted to collect other EC archives needed to perform reconstruction
* establish a connection to the target nodes and give ssync a DiskFileLike class that it can stream data from. The reader in this class will gather fragments from the nodes and use PyECLib to rebuild each segment before yielding data back to ssync
Essentially what this means is that data is buffered, in memory, on a per segment basis
at the node performing reconstruction and each segment is dynamically reconstructed and
delivered to ssync_sender where the send_put() method will ship them on over.
The following picture shows what the ssync changes to enable reconstruction. Note that
there are several implementation details not covered here having to do with things like
making sure that the correct fragment archive indexes are used, getting the metadata
correctly setup for the reconstructed object, deleting files/suffix dirs as needed
after reversion, etc., etc.
.. image:: images/recon.png
**Reconstructor local data file cleanup**
NOTE: This section is outdated, needs to be scrubbed. Do not read...
For the reconstructor cleanup is a bit different than replication because, for PUT consistency
reasons, the object server is going to keep the previous .data file (if it existed) just
in case the PUT of the most recent didn't complete successfully on a quorum of nodes. That
leaves the replicator with many scenarios to deal with when it comes to cleaning up old files:
a) Assuming a PUT worked (commit recevied), the reconstructor will need to delete the older
timestamps on the local node. This can be detected locally be examining the TS.data and
TS.durable filenames. Any TS.data that is older than TS.durable can be deleted.
b) Assuming a quorum or better and the .durable file didn't make it to some nodes, the reconstructor
will detect this (different hashes, further examination shows presence of local .durable file and
remote matching ts files but not remote .durable) and simply push the .durable file to the remote
node, basically replicating it.
c) In the event that a PUT was only partially complete but was still able to get a quorum down,
the reconstructor will first need to reconstruct the object and then push the EC archives out
such that all participating nodes have one, then it can delete the older timestamps on the local
node. Once the object is reconstructed, a TS.durable file is created and committed such that
each storage node has a record of the latest durable set much in the same way the multi-phase commit
works in PUT.
d) In the event that a PUT was only partially complete and did not get a quorum,
reconstruction is not possible. The reconstructor therefore needs to delete these files
but there also must be an age factor to prevent it from deleting in flight PUTs. This should be
the default behavior but should be able to be overridden in the event that an admin may want
partials kept for some reason (easier DR maybe). Regardless, logging when this happens makes a
lot of sense. This scenario can be detected when the reconstructor attempts to reconstruct
because it notices it does not have a TS.durable for a particular TS.data and gets enough 409s
that it can't feed PyECLib enough data to reconstruct (it will need to feed PyECLib what it gets
and PYECLib will tell it if there's not enough though). Whether we delete the .data file, mark it
somehow so we don't keep trying to reconstruct is TBD.
**Reconstructor rebalance**
Current thinking is that there should be no special handling here above and beyond the changes
described in the handoff reversion section.
**Reconstructor concurrency**
There are 2 aspects of concurrency to consider with the reconstructor:
1) concurrency of the daemon
This means the same for the reconstructor as it does for the replicator, the
size of the GreenPool used for the 'update' and 'update_deleted' jobs.
2) overall parallelism of partition reconstruction
With regards to node-node communication we have already covered the notion that
the reconstructor cannot simply check in with its neighbors to determine what
action is should take, if any, on its current run because it needs to know the
status of the full stripe (not just the status of one or two other EC archives).
However, we do not want it to actually take action on all other nodes. In other
words, we do want to check in with every node to see if a reconstruction is needed
and in the event that it is, we dont want to attempt reconstruction on partner
nodes, its left and right neighbors. This will minimize reconstruction races but
still provide for redundancy in addressing the reconstruction of an EC archive.
In the event that a node (HDD) is down, there will be 2 partners for that node per
partition working the reconstruction thus if we had 6 primaries, for example,
and an HDD dies on node 1. We only want nodes 0 and 2 to add jobs to their local
reconstructor even though when they call obj_ring.get_part_nodes(int(partition))
to get a list of other members of the stripe they will get back 6 nodes. The local
node will make its decision as to whether to add a reconstruction job or not based
on its position in the node list.
In doing this, we minimize the reconstruction races but still enable all 6 nodes to be
working on reconstruction for a failed HDD as the partitions will be distributed
amongst all of the nodes therefore the node with the dead HDD will potentially have
all other nodes pushing reconstructed EC archives to the handoff node in parallel on
different partitions with every partition having at most 2 nodes racing to reconstruct
its archives.
The following picture illustrates the example above.
.. image:: images/recons_ex1.png
**SCENARIOS:**
The following series of pictures illustrate the various scenarios more completely. We will use
these scenarios against each of the main functions of the reconstructor which we will define as:
#. Reconstructor framework (daemon)
#. Reconstruction (Ssync changes per spec sequence diagram)
#. Reconstructor local data file cleanup
#. Rebalance
#. Handoff reversion (move data back to primary)
*TODO: Once designs are proposed for each of the main areas above, map to scenarios below for completeness.*
.. image:: images/recons1.png
.. image:: images/recons2.png
.. image:: images/recons3.png
.. image:: images/recons4.png
.. image:: images/recons5.png
.. image:: images/recons6.png
.. image:: images/recons7.png
.. image:: images/recons8.png
.. image:: images/recons9.png
.. image:: images/recons10.png
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.7.1: Reconstructor framework
* 3.3.7.2: Ssync changes per spec sequence diagram
* 3.3.7.3: Reconstructor local data file cleanup
* 3.3.7.4: Node to node communication and synchrinozation on stripe status
* 3.3.7.5: Reconstructor rebalance
* 3.3.7.6: Reconstructor handoff reversion
* 3.3.7.7: Add conf file option to never delete un-reconstructable EC archives
3.3.8 **Auditor**
Because the auditor already operates on a per storage policy basis, there are no specific
auditor changes associated with EC. Each EC archive looks like, and is treated like, a
regular object from the perspective of the auditor. Therefore, if the auditor finds bit-rot
in an EC archive, it simply quarantines it and the EC reconstructor will take care of the rest
just as the replicator does for replication policies. Because quarantine directories are
already isolated per policy, EC archives have their own quarantine directories.
3.3.9 **Performance**
Lots of considerations, planning, testing, tweaking, discussions, etc., etc. to do here
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.9.1: Performance Analysis
3.3.10 **The Ring**
I think the only real thing to do here is make rebalance able to move more than 1 replica of a
given partition at a time. In my mind, the EC scheme is stored in swift.conf, not in the ring,
and the placement and device management doesn't need any changes to cope with EC.
We also want to scrub ring tools to use the word "node" instead of "replicas" to avoid
confusion with EC.
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.10.1: Ring changes
3.3.11 **Testing**
Since these tests aren't always obvious (or possible) on a per patch basis (because of
dependencies on other patches) we need to document scenarios that we want to make sure
are covered once the code supports them.
3.3.11.1 **Probe Tests**
The `Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ card for this has a good
starting list of test scenarios, more should be added as the design progresses.
3.3.11.2 **Functional Tests**
To begin with at least, it believed we just need to make an EC policy the default
and run existing functional tests (and make sure it does that automatically)
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.11.1: Required probe test scenarios
* 3.3.11.2: Required functional test scenarios
3.3.12 **Container Sync**
Container synch assumes the use of replicas. In the current design, container synch from an EC
policy would send only one fragment archive to the remote container, not the reconstructed object.
Therefore container sync needs to be updated to use an internal client instead of the direct client
that would only grab a fragment archive.
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.12.1: Container synch from an EC containers
3.3.13 **EC Configuration Helper Tool**
Script to include w/Swift to help determine what the best EC scheme might be and what the
parameters should be for swift.conf.
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.13.1: EC Configuration Helper Tool
3.3.14 **SAIO Updates**
We want to make sure its easy for the SAIO environment to be used for EC development
and experimentation. Just as we did with policies, we'll want to update both docs
and scripts once we decide what exactly what we want it to look like.
For now lets start with 8 total nodes (4 servers) and a 4+2+2 scheme (4 data, 2 parity, 2 handoffs)
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 3.3.13.1: SAIO Updates (IMPLEMENTED)
3.4 Alternatives
----------------
This design is 'proxy centric' meaning that all EC is done 'in line' as we bring data in/out of
the cluster. An alternate design might be 'storage node centric' where the proxy is really
unaware of EC work and new daemons move data from 3x to EC schemes based on rules that could
include factors such as age and size of the object. There was a significant amount of discussion
on the two options but the former was eventually chosen for the following main reasons:
EC is CPU/memory intensive and being 'proxy centric' more closely aligns with how providers are
planning/have deployed their HW infrastructure
Having more intelligence at the proxy and less at the storage node is more closely aligned with
general Swift architectural principles
The latter approach was limited to 'off line' EC meaning that data would always have to make the
'trip' through replication before becoming erasure coded which is not as usable for many applications
The former approach provides for 'in line' as well as 'off line' by allowing the application
to store data in a replication policy first and then move that data at some point later to EC by
copying the data to a different container. There are thoughts/ideas for alternate means for
allowing a data to change the policy of a container that are not covered here but are recognized to
be possible with this scheme making it even easier for an application to control the data durability
policy.
*Alternate Reconstructor Design*
An alternate, but rejected, proposal is archived on `Trello. <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_
Key concepts for the REJECTED proposal were:
Perform auditing at the fragment level (sub segment) to avoid having the smallest unit of work be an EC archive. This will reduce reconstruction network traffic
Today the auditor quarantines an entire object, for fragment level rebuild we
need an additional step to identify which fragment within the archive is bad and
potentially quarantine in a different location to project the archive from deletion
until the Reconstructor is done with it
Today hashes.pkl only identifies a suffix directory in need of attention. For
fragment level rebuild, the reconstructor needs to have additional information as
its not just syncing at the directory level:
Needs to know which fragment archive in the suffix dir needs work
Needs to know which segment index within the archive is bad
Needs to know the fragment index of the archive (the EC archives position within the set)
Perform reconstruction on the local node, however preserve the push model by having the
remote node communicate reconstruction information via a new verb. This will reduce reconstruction
network traffic. This could be really bad wrt overloading the local node with reconstruction
traffic as opposed to using all the compute power of all systems participating in the partitions
kept on the local node.
*Alternate Reconstructor Design #2*
The design proposal leverages the REPLICATE verb but introduces a new hashes.pkl format
for EC and, for readability, names this file ec_hashes.pkl. The contents of this file will be
covered shortly but it essentially needs to contain everything that any node would need to know
in order to make a pass over its data and decided whether to reconstruct, delete, or move data.
So, for EC, the standard hashes.pkl file and/or functions that operate on it are not relevant.
The data in ec_hashes.pkl has the following properties:
* needs to be synchronized across all nodes
* needs to have complete information about any given object hash to be valid for that hash
* can be complete for some object hashes and incomplete for others
There are many choices for achieving this ranging from gossip methods to consensus schemes. The
proposed design leverages the fact that all nodes have access to a common structure and accessor
functions that are assumed to be synchronized (eventually) such that any node position in the list
can be used to select a master for one of two operations that require node-node communication:
(1) ec_hashes.pkl synchronization and (2) reconstruction.
*ec_hashes.pkl synchronization*
At any given point in time there will be one node out of the set of nodes returned from
get_part_nodes() that will act as the master for synchronizing ec_hashes.pkl information. The
reconstructor, at the start of each pass, will use a bully style algorithm to elect the hash master.
When each reconstructor starts a pass it will send an election message to all nodes with a node
index lower than its own. If unable to connect with said nodes then it assumes the role of
hash master. If any nodes with lower index reply then it continues with the current pass,
processing its objects baed on current information in its ec_hashes.pkl. This bully-like
algoithm won't actually prevent 2 masters from running at the same time (for example nodes 0-2
could all be down so node 3 starts as master and then one of the nodes comes back up, it will
also start the hash synchronization process). Note that this does not cause functional issues,
its just a bit wasteful but saves us from implementing a more complex consensus algorithm
thats not deemed to be worth the effort.
The role of the master will be to:
#. send REPLCIATE to all other nodes in the set
#. merge results
#. send new variation of REPLICATE to all other nodes
#. nodes merge into their ec_hashes.pkl
In this manner there will be typically one node sending 2 REPLICATE verbs to n other nodes
for each pass of the reconstructor so a total of 2(n-1) REPLICATE so O(n) versus O(1) for
replication where 3 nodes would be sending 2 messages each for a constant 6 messages per
pass. Note that there are distinct differences between the merging done by the master
after collecting node pkl files and the merging done at the nodes after receiving the
master version. When the master is merging, it is only updating the master copy with
new information about the sending node. When a node is merging from master, it is only
updating information about all other nodes. In other words, the master is only interested
in hearing information from a node about that node itself and any given node is only
interested in learning about everybody else. More on these merging rules later.
At any given point in time the ec_hashes.pkl file on a node can be in a variety of states, it
is not required that, although a synchronized set was sent by the master, that the synchronized
version be inspected by participating nodes. Each object hash within the ec_hashes.pkl will
have information indicating whether that particular entry is synchronized or not, therefore it
may be the case that a particular pass of a reconstructor run parse an ec_hashes.pkl file and
only find some percentage N of synchronized entries where N started at 100% and dropped from there
as changes were made to the local node (objects added, objects quarantined). An example will
be provided after defining the format of the file.
ec_hashes data structure
{object_hash_0: {TS_0: [node0, node1, ...], TS_n: [node0, node1, ...], ...},
object_hash_1: {TS_0: [node0, node1, ...], TS_n: [node0, node1, ...], ...},
object_hash_n: {TS_0: [node0, node1, ...], TS_n: [node0, node1, ...], ...}}
where nodeX takes on values of unknown, not present or present such that a reconstructor
parsing its local structure can determine on an object by object basis which TS files
exist on which nodes, which ones it is missing on or if it has incomplete information for
that TS (a node value for that TS is marked as unknown). Note that although this file format
will contain per object information, objects are removed from the file by the local nodes
once the local node has *seen* information from all other nodes for that entry. Therefore
the file will not contain an entry for every object in the system but instead a transient
entry for every object while its being accepted into the system (having its consistency wrt
EC verified).
The new ec_hashes.pkl is subject to several potential writers including the hash master,
its own local reconstructor, the auditor, the PUT path, etc., and will therefore be using
the same locking that hashes.pkl uses today. The following illustrates the ongoing
updates to ec_hashes.pkl
.. image:: images/ec_pkl_life.png
As the ec_hashes.pkl file is updated, the following rules apply:
As a **hash master** updating a local master file with any single node file:
(recall the goal here is to update the master with info about the incoming node)
* data is never deleted (ie if an object hash or TS key exists in master but does not in the incoming dictionary, the entry is left in tact)
* data can be added (if an object hash or TS key exists in an incoming dicitonary but does not exist in master it is added)
* where keys match, only the node index in the TS list for the incoming data is affected and that data is replaced in master with the incoming information
As a **non-master** node merging from the master:
(recall that the goal here is to have this node learn the other nodes in the cluster)
* an object hash is deleted as soon as all nodes are maked present
* data can be added, same as above
* where keys match, only *other* the indicies in the TS list for the incoming data is affected and that data is replaced with the incoming information
**Some examples**
The following are some example scenarios (used later to help explain use cases) and their
corresponding ec_hashes data structures.
.. image:: images/echash1.png
.. image:: images/echash2.png
4. Implementation
=================
Assignee(s)
-----------
There are several key contributors, torgomatic is the core sponsor
Work Items
----------
See `Trello discussion board <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_
Repositories
------------
Using Swift repo
Servers
-------
N/A
DNS Entries
-----------
N/A
5. Dependencies
===============
As mentioned earlier, the EC algorithms themselves are implemented externally in
multiple libraries. See the main site for the external work at `PyECLib <https://bitbucket.org/kmgreen2/pyeclib>`_
PyECLib itself is already an accepted `requirement. <https://review.openstack.org/#/c/76068/>`_
Work is ongoing to make sure that additional package depend ices for PyECLib are ongoing...
There is a linux package, liberasurecode, that is also being developed as part of this effort
and is needed by PyECLib. Getting it added for devstack tempest tests and unittests slaves is
currently WIP by tsg
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
* 5.1: Enable sysmeta on object PUT (IMPLEMENTED)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 123 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 126 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 90 KiB

View File

@ -1,445 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
=====================================
Composite Tokens and Service Accounts
=====================================
This is a proposal for how Composite Tokens can be used by services such
as Glance and Cinder to store objects in project-specific accounts yet
retain control over how those objects are accessed.
This proposal uses the "Service Token Composite Authorization" support in
the auth_token Keystone middleware
(http://git.openstack.org/cgit/openstack/keystone-specs/plain/specs/keystonemiddleware/service-tokens.rst).
Problem Description
===================
Swift is used by many OpenStack services to store data on behalf of users.
There are typically two approaches to where the data is stored:
* *Single-project*. Objects are stored in a single dedicated Swift account
(i.e., all data belonging to all users is stored in the same account).
* *Multi-project*. Objects are stored in the end-user's Swift account (project).
Typically, dedicated container(s) are created to hold the objects.
There are advantages and limitations with both approaches as described in the
following table:
==== ========================================== ========== ========
Item Feature/Topic Single- Multi-
Project Project
---- ------------------------------------------ ---------- --------
1 Fragile to password leak (CVE-2013-1840) Yes No
2 Fragile to token leak Yes No
3 Fragile to container deletion Yes No
4 Fragile to service user deletion Yes No
5 "Noise" in Swift account No Yes
6 Namespace collisions (user and service No Yes
picking same name)
7 Guarantee of consistency (service Yes No
database vs swift account)
8 Policy enforcement (e.g., Image Download) Yes No
==== ========================================== ========== ========
Proposed change
===============
It is proposed to put service data into a separate account to the end-user's
"normal" account. Although the account has a different name, the account
is linked to the end-user's project. This solves issues with noise
and namespace collisions. To remove fragility and improve consistency
guarantees, it is proposed to use the composite token feature to manage
access to this account.
In summary, there are three related changes:
* Support for Composite Tokens
* The authorization logic can require authenticated information from the
composite tokens
* Support for multiple reseller prefixes, each with their own configuration
The effect is that access to the data must be made through the service.
In addition, the service can only access the data when it is processing
a request from the end-user (i.e, when it has an end-user's token).
The changes are described one by one in this document. The impatient can
skip to "Composite Tokens in the OpenStack Environment" for a complete,
example.
Composite Tokens
================
The authentication system will validate a second token. The token is stored
in the X-Service-Token header so is known as the service token (name chosen
by Keystone).
The core functions of the token authentication scheme is to determine who the
user is, what account is being accessed and what roles apply. Keystoneauth
and Tempauth have slightly different semantics, so the tokens are combined
in slightly different ways as explained in the following sections.
Combining Roles in Keystoneauth
-------------------------------
The following rules are used when a service token is present:
* The user_id is the user_id from the first token (i.e., no change)
* The account (project) is specified by the first token (i.e., no change)
* The user roles are initially determined by the first token (i.e., no change).
* The roles from the service token are made available in service_roles.
Example 1 - Combining Roles in keystoneauth
-------------------------------------------
In this example, the <token-two> is scoped to a different project than
the account/project being accessed::
Client
| <user-token>: project-id: 1234
| user-id: 9876
| roles: admin
| X-Auth-Token: <user-token>
| X-Service-Token: <token-two>
|
| <token-two>: project-id: 5678
v user-id: 5432
Swift roles: service
|
v
Combined identity information:
user_id: 9876
project_id: 1234
roles: admin
service_roles: service
Combining Groups in Tempauth
----------------------------
The user groups from both tokens are simply combined into one list. The
following diagram gives an example of this::
Client
| <user-token>: from "joe"
|
|
| X-Auth-Token: <user-token>
| X-Service-Token: <token-two>
|
| <token-two>: from "glance"
v
Swift
|
| [filter:tempauth]
| user_joesaccount_joe: joespassword .admin
| user_glanceaccount_glance: glancepassword servicegroup
|
v
Combined Groups: .admin servicegroup
Support for multiple reseller prefixes
======================================
The reseller_prefix will now support a list of prefixes. For example,
the following supports both ``AUTH_`` and ``SERVICE_`` in keystoneauth::
[filter:keystoneauth]
reseller_prefix = AUTH_, SERVICE_
For backward compatibility, the default remains as ``AUTH_``.
All existing configuration options are assumed to apply to the first
item in the list. However, to indicate which prefix an option applies to,
put the prefix in front of the option name. This applies to the
following options:
* operator_roles (keystoneauth)
* service_roles (described below) (keystoneauth)
* require_group (described below) (tempauth)
Other options (logging, storage_url_scheme, etc.) are not specific to
the reseller prefix.
For example, this shows two prefixes and some options::
[filter:keystoneauth]
reseller_prefix = AUTH_, SERVICE_
reseller_admin_role = ResellerAdmin <= global, applies to all
AUTH_operator_roles = admin <= new style
SERVICE_operator_roles = admin
allow_overrides = false
Support for composite authorization
===================================
We will add an option called "service_roles" to keystoneauth. If
present, composite tokens must be used and the service_roles must contain the
listed roles. Here is an example where the ``AUTH_`` namespace requires the
"admin" role be associated with the X-Auth-Token. The ``SERVICE_`` namespace
requires that the "admin" role be associated with X-Auth-Token. In
addition, it requires that the "service" role be associated with
X-Service-Token::
[filter:keystoneauth]
reseller_prefix = AUTH_, SERVICE_
AUTH_operator_roles = admin
SERVICE_operator_roles = admin
SERVICE_service_roles = service
In tempauth, we will add an option called "require_group". If present,
the user or service user must be a member of this group. (since tempauth
combines groups from both X-Auth-Token and X-Service-Token, the required
group may come from either or both tokens).
The following shows an example::
[filter:tempauth]
reseller_prefix = AUTH_, SERVICE_
SERVICE_require_group = servicegroup
Composite Tokens in the OpenStack Environment
=============================================
This section presents a simple configuration showing the flow from client
through an OpenStack Service to Swift. We use Glance in this example, but
the principal is the same for all services. See later for a more
complex service-specific setup.
The flow is as follows::
Client
| <user-token>: project-id: 1234
| user-id: 9876
| (request) roles: admin
| X-Auth-Token: <user-token>
|
v
Glance
|
| PUT /v1/SERVICE_1234/container/object
| X-Auth-Token: <user-token>
| X-Service-Token: <glance-token>
|
| <glance-token>: project-id: 5678
v user-id: 5432
Swift roles: service
|
v
Combined identity information:
user_id: 9876
project-id: 1234
roles: admin
service_roles: service
[filter:keystoneauth]
reseller_prefix = AUTH_, SERVICE_
AUTH_operator_roles = admin
AUTH_reseller_admin_roles = ResellerAdmin
SERVICE_operator_roles = admin
SERVICE_service_roles = service
SERVICE_reseller_admin_roles = ResellerAdmin
The authorization logic is as follows::
/v1/SERVICE_1234/container/object
-------
|
in?
|
reseller_prefix = AUTH_, SERVICE_
\
Yes
\
Use SERVICE_* configuration
|
|
/v1/SERVICE_1234/container/object
----
|
same as? project-id: 1234
\
Yes
\
roles: admin
|
in? SERVICE_operator_roles = admin
\
Yes
\
service_roles: service
|
in? SERVICE_service_roles = service
\
Yes
\
----> swift_owner = True
Other Aspects
=============
Tempurl, FormPOST, Container Sync
---------------------------------
These work on the principal that the secret key is stored in a *privileged*
header. No change is proposed as the account controls described in this
document continue to use this concept. However, an additional use-case
becomes possible: it should be possible to use temporary URLs to
allow a client to upload or download objects to or from a service
account.
Service-Specific Accounts
-------------------------
Using a common ``SERVICE_`` namespace means that all OpenStack Services share
the same account. A simple alternative is to use multiple accounts -- with
corresponding reseller_prefixes and service catalog entries. For example,
Glance could use ``IMAGE_`` and Cinder could use ``VOLUME_``. There is nothing
in this proposal that limits this option. Here is an example of a
possible configuration::
[filter:keystoneauth]
reseller_prefix = AUTH_, IMAGE_, VOLUME_
IMAGE_service_roles glance_service
VOLUME_service_roles = cinder_service
python-swiftclient
------------------
No changes are needed in python-swiftclient to support this feature.
Service Changes To Use ``SERVICE_`` Namespace
---------------------------------------------
Services (such as Glance, Cinder) need to be enhanced as follows to use
the ``SERVICE_`` namespace:
* Change the path to use the appropriate prefix. Applications have
HTTP_X_SERVICE_CATALOG in their environment so it is easy to construct the
appropriate path.
* Add their token to the X-Service-Token header
* They should have the appropriate service role for this token
* They should include their service type (e.g., image) as a prefix to any
container names they create. This will prevent conflict between services
sharing the account.
Upgrade Implications
====================
The Swift software must be upgraded before Services attempt to use the
``SERVICE_`` namespace. Since Services use configurable options
to decide how they use Swift, this should be easy to sequence (i.e., upgrade
software first, then change the Service's configuration options).
How Services handle existing legacy data is beyond the scope of this
proposal.
Alternatives
============
*Account ACL*
An earlier draft proposed extending the account ACL. It also proposed to
add a default account ACL concept. On review, it was decided that this
was unnecessary for this use-case (though that work might happen in it's
own right).
*Co-owner sysmeta*
An earlier draft proposed new sysmeta that established "co-ownership"
rules for containers.
*policy.xml File*:
The Keystone Composite Authorization scheme has use cases for other Openstack
projects. The OSLO incubator policy checker module may be extended to support
roles acquired from X-Service-Token. However, this will only be used in
Swift if keystoneauth already uses a policy.xml file.
If policy files are adapted by keystoneauth, it should be easy to apply. In
effect, a different policy.xml file would be supplied for each reseller prefix.
*Proxy Logging*:
The proxy-logging middleware logs the value of X-Auth-Token. No change is
proposed.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
donagh.mccabe@hp.com
To be fully effective, changes are needed in other projects:
* Keystone Middleware. Done
* OSLO. As mentioned above, probably not needed or depended on.
* Glance. stuart.mclaren@hp.com will make the Glance changes.
* Cinder. Unknown.
* Devstack. The Swift change by itself will probably not require Devstack
changes. The Glance and Cinder services may need additional configuration
options to enable the X-Service-Token feature.
Assignee: Unknown
* Tempest. In principal, no changes should be needed as the proposal is
intended to be transparent to end-users. However, it may be possible
that some tests incorrectly access images or volume backups directly.
Assignee: Unknown
* Ceilometer (for ``SERVICE_`` namespace). It is not clear if any
changes are needed or desirable.
Work Items
----------
* swift/common/middleware/tempauth.py is modified to support multiple
reseller prefixes, the require_group options and to process the
X-Service-Token header
* swift/common/middleware/keystoneauth.py is modified to support multiple
reseller prefixes and the service_roles option.
* Write unit tests
* Write functional tests
Repositories
------------
No new git repositories will be created.
Servers
-------
No new servers are created. The keystoneauth middleware is used by the
proxy-server.
DNS Entries
-----------
No DNS entries will to be created or updated.
Dependencies
============
* "Service Token Composite Authorization"
https://review.openstack.org/#/c/96315

View File

@ -1,736 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
******************
At-Rest Encryption
******************
1. Summary
==========
To better protect the data in their clusters, Swift operators may wish
to have objects stored in an encrypted form. This spec describes a
plan to add an operator-managed encryption capability to Swift while
remaining completely transparent to clients.
Goals
-----
Swift objects are typically stored on disk as files in a standard
POSIX filesystem; in the typical 3-replica case, an object is
represented as 3 files on 3 distinct filesystems within the cluster.
An attacker may gain access to disks in a number of ways. When a disk
fails, it may be returned to the manufacturer under warranty; since it
has failed, erasing the data may not be possible, but the data may
still be present on the platters. When disks reach end-of-life, they
are discarded, and if not properly wiped, may still contain data. An
insider might steal or clone disks from the data center.
Goal 1: an attacker who gains read access to Swift's object servers'
filesystems should gain as little useful data as possible. This
provides confidentiality for users' data.
Goal 2: when a keymaster implementation allows for secure deletion of keys,
then the deletion of an object's key shall render the object irrecoverable.
This provides a means to securely delete an object.
Not Goals / Possible Future Work
--------------------------------
There are other ways to attack a Swift cluster, but this spec does not
address them. In particular, this spec does not address these threats:
* an attacker gains access to Swift's internal network
* an attacker compromises the key database
* an attacker modifies Swift's code (on the Swift nodes) for evil
If these threats are mitigated at all, it is a fortunate byproduct, but it is
not the intent of this spec to address them.
2. Encryption and Key Management
================================
There are two logical parts to at-rest encryption. The first part is
the crypto engine; this performs the actual encryption and decryption
of the data and metadata.
The second part is key management. This is the process by which the
key material is stored, retrieved and supplied to the crypto engine.
The process may be split with an agent responsible for storing key
material safely (sometimes a Hardware Security Module) and an agent
responsible for retrieving key material for the crypto engine. Swift
will support a variety of key-material retrievers, called
"keymasters", via Python's entry-points mechanism. Typically, a Swift
cluster will use only one keymaster.
2.1 Request Path
----------------
The crypto engine and the keymaster shall be implemented as three
separate pieces of middleware. The crypto engine shall have both
"decrypter" and "encrypter" filter-factory functions, and the
keymaster filter shall sit between them. Example::
[pipeline:main]
pipeline = catch_errors gatekeeper ... decrypter keymaster encrypter proxy-logging proxy-server
The encrypter middleware is responsible for encrypting the object's
data and metadata on a PUT or POST request.
The decrypter middleware is responsible for three things. First, it
decrypts the object's data and metadata on an object GET or HEAD
response. Second, it decrypts the container listing entries and the
container metadata on a container GET or HEAD response. Third, it
decrypts the account metadata on an account GET or HEAD response.
DELETE requests are unaffected by encryption, so neither
the encrypter nor decrypter need to do anything. The keymaster may
wish to delete any key or keys associated with the deleted entity.
OPTIONS requests should be ignored entirely by the crypto engine, as
OPTIONS requests and responses contain neither user data nor user
metadata.
2.1.1 Large Objects
-------------------
In Swift, large objects are composed of segments, which are plain old
objects, and a manifest, which is a special object that ties the
segments together. Here, "special" means "has a particular header
value".
Large-object support is implemented in middlewares ("dlo" and "slo").
The encrypter/keymaster/decrypter trio must be placed to the right of
the dlo and slo middlewares in the proxy's middleware pipeline. This
way, the encrypter and decrypter do not have to do any special
processing for large objects; rather, each request is for a plain old
object, container, or account.
2.1.2 Etag Validation
---------------------
With unencrypted objects, the object server is responsible for
validating any Etag header sent by the client on a PUT request; the
Etag header's value is the MD5 hash of the uploaded object data.
With encrypted objects, the plaintext is not available to the object server, so
the encrypter must perform the validation instead by calculating the MD5 hash
of the object data and validating this against any Etag header sent by the
client - if the two do not match then the encrypter should immediately return a
response with status 422.
Assuming that the computed MD5 hash of plaintext is validated, the encrypter
will encrypt this value and pass to the object server to be stored as system
metadata. Since the validated value will not be available until the plaintext
stream has been completely read, this metadata will be sent using a 'request
footer', as described in section 7.2.
If the client request included an Etag header then the encrypter should also
compute the MD5 hash of the ciphertext and include this value in an Etag
request footer. This will allow the object server to validate the hash of the
ciphertext that it receives, and so complete the end-to-end validation
requirement implied by the client sending an Etag: encrypter validates client
to proxy communication, object server validates proxy to object server
communication.
2.2 Inter-Middleware Communication
----------------------------------
The keymaster is responsible for deciding if any particular resource should be
encrypted. This decision is implementation dependent but may be based, for
example, on container policy or account name. When a resource is not to be
encrypted the keymaster will set the key `swift.crypto.override` in the request
environ to indicate to the encrypter middleware that encryption is not
required.
When encryption is required, the keymaster communicates the encryption key to
the encrypter and decrypter middlewares by placing a zero-argument callable in
the WSGI environment dictionary at the key "swift.crypto.fetch_crypto_keys".
When called, this will return the key(s) necessary to process the current
request. It must be present on any GET or HEAD request for an account,
container, or object which contains any encrypted data or metadata. If
encrypted data or metadata is encountered while processing a GET or HEAD
request but fetch_crypto_keys is not present _or_ it does not return keys when
called, then this is an error and the client will receive a 500-series
response.
On a PUT or POST request, the keymaster must place
"swift.crypto.fetch_crypto_keys" in the WSGI environment during request
processing; that is, before passing the request to the remainder of the
middleware pipeline. This is so that the encrypter can encrypt the object's
data in a streaming fashion without buffering the whole object.
On a GET or HEAD request, the keymaster must place
"swift.crypto.fetch_crypto_keys" in the WSGI environment before returning
control to the decrypter. It need not be done at request-handling time. This
lets attributes of the key be stored in sysmeta, for example the key ID in an
external database, or anything else the keymaster wants.
3. Cipher Choice
================
3.1. The Chosen Cipher
----------------------
Swift will use AES in CTR mode with 256-bit keys.
In order to allow for ranged GET requests, the cipher shall be used
in counter (CTR) mode.
The entire object body shall be encrypted as a single byte stream. The
initialization vector (IV) used for encrypting the object body will be randomly
generated and stored in system metadata.
3.2. Why AES-256-CTR
--------------------
CTR mode basically turns a block cipher into a stream cipher, so
dealing with range GET requests becomes much easier. No modification
of the client's requested byte ranges is needed. When decrypting, some
padding will be required to align the requested data to AES's 16-byte
block size, but that can all be done at the proxy level.
Remember that when a GET request is made, the decrypter knows nothing
about the object. The object may or may not be encrypted; it may or
may not exist. If Swift were to allow configurable cipher modes, then
the requested byte range would have to be expanded to get enough bytes
for any supported cipher mode at all, which means taking into account
the block size and operating characteristics of every single supported
cipher/blocksize/mode. Besides the network overhead (especially for
small byteranges), the complexity of the resulting code would make it
an excellent home for bugs.
3.3 Future-Proofing
-------------------
The cipher and mode will be stored in system metadata on every
encrypted object. This way, when Swift gains support for other ciphers
or modes, existing objects can still be decrypted.
In general we must assume that any resource (account/container/object metadata
or object data) in a Swift cluster may be encrypted using a different cipher,
or not encrypted. Consequently, the cipher choice must be stored as metadata of
every encrypted resource, along with the IV. Since user metadata may be updated
independently of objects, this implies storing encryption related metadata of
metadata.
4. Robustness
=============
4.1 No Key
----------
If the keymaster fails to add "swift.crypto.fetch_crypto_keys" to the WSGI
environment of a GET request, then the client would receive the ciphertext of
the object instead of the plaintext, which looks to the client like garbage.
However, we can tell if an object is encrypted or not by the presence of system
metadata headers, so the decrypter can prevent this by raising an error if no
key was provided for the decryption of an encrypted object.
5. Multiple Keymasters
======================
5.1 Coexisting Keymasters
-------------------------
Just as Swift supports multiple simultaneous auth systems, it can
support multiple simultaneous keymasters. With auth, each auth system
claims a subset of the Swift namespace by looking at accounts starting
with their reseller prefix. Similarly, multiple keymasters may
partition the Swift namespace in some way and thus coexist peacefully.
5.2 Keymasters in Core Swift
----------------------------
5.2.1 Trivial Keymaster
^^^^^^^^^^^^^^^^^^^^^^^
Swift will need a trivial keymaster for functional tests of the crypto
engine. The trivial keymaster will not be suitable for production use
at all. To that end, it should be deliberately kept as small as
possible without regard for any actual security of the keys.
Perhaps the trivial keymaster could use the SHA-256 of a configurable
prefix concatenated with the object's full path for the cryptographic
key. That is,::
key = SHA256(prefix_from_conf + request.path)
This will allow for testing of the PUT and GET paths, the COPY path
(the destination object's key will differ from the source object's),
and also the invalid key path (by changing the prefix after an object
is PUT).
5.2.2 Barbican Keymaster
^^^^^^^^^^^^^^^^^^^^^^^^
Swift will probably want a keymaster that stores things in Barbican at
some point.
5.3 Keymaster implementation considerations - informational only
----------------------------------------------------------------
As stated above, Swift will support a variety of keymaster implementations, and
the implementation details of any keymaster is beyond the scope of this spec
(other than providing a trivial keymaster for testing). However, we include
here an *informational* discussion of how keymasters might behave, particularly
with respect to managing the choice of when to encrypt a resource (or not).
The keymaster is ultimately responsible for specifying *whether or not* a
resource should be encrypted. The means of communicating this decision is the
request environ variable `swift.crypto.override`, as discussed above. (The only
exception to this rule may be in the case that the decrypter finds no crypto
metadata in the headers, and assumes that the object was never encrypted.)
If we consider object encryption (as opposed to account or container metadata),
a keymaster may choose to specify encryption of objects on a per-account,
per-container or per-object basis. If encryption is specified per-account or
per-container, the keymaster may base its decision on metadata that it (or some
other agent) has previously set on the account or container. For example:
* an administrator or user might add keymaster-specific system metadata to an
account when it is created;
* a keymaster may inspect container metadata for a storage policy index that
it then maps to an encrypt/don't-encrypt decision;
* a keymaster may accept a client supplied header that enables/disables
encryption and transform that to system metadata that it subsequently
inspects on each request to that resource.
If encryption is specified per-object then the decision may be based on the
object's name or based on client supplied header(s).
The keymaster is also responsible for specifying *which key* is used when a
resource is to be encrypted/decrypted. Again, if we focus on object encryption,
the keymaster could choose to use a unique key for each object, or for all
objects in the same container, or for all object in the same account (using a
single key for an entire cluster is not disallowed but would not be
recommended). The specification of crypto metadata storage below is flexible
enough to support any of those choices.
If a keymaster chooses to specify a unique key for each object then it will
clearly need to be capable of managing as many keys as there are objects in the
cluster. For performance reasons it should also be capable of retrieving any
object's key in a timely fashion when required. A keymaster *might* choose to
store encrypted keys in Swift itself: for example, an object's unique key could
be encrypted using its container key before storing perhaps as object metadata.
However, although scalable, such a solution might not provide the desired
properties for 'secure deletion' of keys since the deletion of an object in
Swift does not guarantee immediate deletion of content on disk.
For the sake of illustration, consider a *hypothetical* keymaster
implementation code-named Vinz. Vinz enables object encryption on a
per-container basis:
* for every object PUT, Vinz inspects the target container's metadata to
discover the container's storage policy.
* Vinz then uses the storage policy as a key into its own encryption policy
configuration.
* Containers using storage-policy 'gold' or 'silver' are encrypted, containers
using storage policy 'bronze' are not encrypted.
* Significantly, the mapping of storage policy to encryption policy is a
property of the keymaster alone and could be changed if desired.
* Vinz also checks the account metadata for a metadata item
'X-Account-Sysmeta-Vinz-Encrypt: always' that a sys admin may have set. If
present Vinz will specify object encryption regardless of the container
policy.
* For objects that are to be encrypted/decrypted, Vinz adds the variable
``swift.crypto.fetch_crypto_keys=vinz_fetch_crypto_keys`` to the request
environ. Vinz also interacts with Barbican to fetch a key for the object's
container which it provides in response to calls to
``vinz_fetch_crypto_keys``.
* For objects that are not to be encrypted/decrypted, Vinz adds the variable
``swift.crypto.override=True`` to the request environ.
6 Encryption of Object Body
===========================
Each object is encrypted with the key from the keymaster. A new IV is
randomly generated by the encrypter for each object body.
The IV and the choice of cipher is stored using sysmeta. For the following
discussion we shall refer to the choice of cipher and IV collectively as
"crypto metadata".
The crypto metadata for object body can be stored as an item of sysmeta that
the encrypter adds to the object PUT request headers, e.g.::
X-Object-Sysmeta-Crypto-Meta: "{'iv': 'xxx', 'cipher': 'AES_CTR_256'}"
.. note::
Here, and in following examples, it would be possible to omit the
``'cipher'`` keyed item from the crypto metadata until a future
change introduces alternative ciphers. The existence of any crypto metadata
is sufficient to infer use of the 'AES_CTR_256' unless otherwise specified.
7. Metadata Encryption
======================
7.1 Background
--------------
Swift entities (accounts, containers, and objects) have three kinds of
metadata.
First, there is basic object metadata, like Content-Length, Content-Type, and
Etag. These are always present and user-visible.
Second, there is user metadata. These are headers starting with
X-Object-Meta-, X-Container-Meta-, or X-Account-Meta- on objects,
containers, and accounts, respectively. There are per-entity limits on
the number, individual sizes, and aggregate size of user metadata.
User metadata is optional; if present, it is user-visible.
Third and finally, there is system metadata, often abbreviated to
"sysmeta". These are headers starting with X-Object-Sysmeta-,
X-Container-Sysmeta-, and X-Account-Sysmeta-. There are _no_ limits on
the number or aggregate sizes of system metadata, though there may be
limits on individual datum sizes due to HTTP header-length
restrictions. System metadata is not user-visible or user-settable; it
is intended for use by Swift middleware to safely store data away from
the prying eyes and fingers of users.
7.2 Basic Object Metadata
-------------------------
An object's plaintext etag and content type are sensitive information and will
be stored encrypted, both in the container listing and in the object's
metadata. To accomplish this, the encrypter middleware will actually encrypt
the etag and content type _twice_: once with the object's key, and once with
the container's key.
There must be a different IV used for each different encrypted header.
Therefore, crypto metadata will be stored for the etag and content_type::
X-Object-Sysmeta-Crypto-Meta-ct: "{'iv': 'xxx', 'cipher': 'AES_CTR_256'}"
X-Object-Sysmeta-Crypto-Meta-Etag: "{'iv': 'xxx', 'cipher': 'AES_CTR_256'}"
The object-key-encrypted values will be sent to the object server using
``X-Object-Sysmeta-Crypto-Etag`` and ``Content-Type`` headers that will be
stored in the object's metadata.
The container-key-encrypted etag and content-type values will be sent to the
object server using header names ``X-Backend-Container-Update-Override-Etag``
and ``X-Backend-Container-Update-Override-Content-Type`` respectively. Existing
object server behavior is to then use these values in the ``X-Etag`` and
``X-Content-Type`` headers included with the container update sent to the
container server.
When handling a container GET request, the decrypter must process the container
listing and decrypt every occurrence of an Etag or Content-Type using the
container key. When handling an object GET or HEAD, the decrypter must decrypt
the values of ``X-Object-Sysmeta-Crypto-Etag`` and
``X-Object-Sysmeta-Crypto-Content-Type`` using the object key and copy these
value to the ``Etag`` and ``Content-Type`` headers returned to the client.
This way, the client sees the plaintext etag and content type in container
listings and in object GET or HEAD responses, just like it would without
encryption enabled, but the plaintext values of those are not stored anywhere.
.. note::
The encrypter will not know the value of the plaintext etag until it has
processed all object content. Therefore, unless the encrypter buffers the
entire object ciphertext (!) it cannot send the encrypted etag headers to
object servers before the request body. Instead, the encrypter will emit a
multipart MIME document for the request body and append the encrypted etag
as a 'request footer'. This mechanism will build on the use of
multipart MIME bodies in object server requests introduced by the Erasure
Coding feature [1].
For basic object metadata that is encrypted (i.e. etag and content-type), the
object data crypto metadata will apply, since this basic metadata is only set
by an object PUT. However, the encrypted copies of basic object metadata that
are forwarded to container servers with container updates will require
accompanying crypto metadata to also be stored in the container server DB
objects table. To avoid significant code churn in the container server, we
propose to append the crypto metadata to the basic metadata value string.
For example, the Etag header value included with a container update will have
the form::
Etag: E(CEK, <etag>); meta={'iv': 'xxx', 'cipher': 'AES_CTR_256'}
where ``E(CEK, <etag>)`` is the ciphertext of the object's etag encrypted with
the container key (``CEK``).
When handling a container GET listing, the decrypter will need to parse each
etag value in the listing returned from the container server and transform its
value to the plaintext etag expected in the response to the client. Since a
'regular' plaintext etag is a fixed length string that cannot contain the ';'
character, the decrypter will be able to easily differentiate between an
unencrypted etag value and an etag value with appended crypto metadata that by
design is always longer than a plaintext etag.
The crypto metadata appended to the container update etag will also be valid
for the encrypted content-type ``E(CEK, <content-type>)`` since both are set at
the same time. However, other proposed work [2] makes it possible to update the
object content-type with a POST, meaning that the crypto metadata associated
with content-type value could be different to that associated with the etag. We
therefore propose to similarly append crypto metadata in the content-type value
that is destined for the container server:
Content-Type: E(CEK, <content-type>); meta="{'iv': 'yyy', 'cipher': 'AES_CTR_256'}"
In this case the use of the ';' separator character will allow the decrypter to
parse content-type values in container listings and remove the crypto metadata
attribute.
7.2.1 A Note On Etag
^^^^^^^^^^^^^^^^^^^^
In the stored object's metadata, the basic-metadata field named "Etag"
will contain the MD5 hash of the ciphertext. This is required so that
the object server will not error out on an object PUT, and also so
that the object auditor will not quarantine the object due to hash
mismatch (unless bit rot has happened).
The plaintext's MD5 hash will be stored, encrypted, in system
metadata.
7.3 User Metadata
-----------------
Not only the contents of an object are sensitive; metadata is sensitive too.
Since metadata values must be valid UTF-8 strings, the encrypted values will be
suitably encoded (probably base64) for storage. Since this encoding may
increase the size of user metadata values beyond the allowed limits, the
metadata limit checking will need to be implemented by the encrypter
middleware. That way, users don't see lower metadata-size limits when
encryption is in use. The encrypter middleware will set a request environ key
`swift.constraints.override` to indicate to the proxy-server that limit
checking has already been applied.
User metadata names will *not* be encrypted. Since a different IV (or indeed a
different cypher) may be used each time metadata is updated by a POST request,
encrypting metadata names would make it impossible for Swift to delete
out-dated metadata items. Similarly, if encryption is enabled on an existing
Swift cluster, encrypting metadata names would prevent previously unencrypted
metadata being deleted when updated.
For each piece of user metadata on objects we need to store crypto metadata,
since all user metadata items are encrypted with a different IV. This cannot
be stored as an item of sysmeta since sysmeta cannot be updated by an object
POST. We therefore propose to modify the object server to persist the headers
``X-Object-Massmeta-Crypto-Meta-*`` with the same semantic as ``X-Object-Meta-*``
headers i.e. ``X-Object-Massmeta-Crypto-Meta-*`` will be updated on every POST
and removed if not present in a POST. The gatekeeper middleware will prevent
``X-Object-Massmeta-Crypto-Meta-*`` headers ever being included in client
requests or responses.
The encrypter will add a ``X-Object-Massmeta-Crypto-Meta-<key>`` header
to object PUT and POST request headers for each piece of user metadata, e.g.::
X-Object-Massmeta-Crypto-Meta-<key>: "{'iv': 'zzz', 'cipher': 'AES_CTR_256'}"
.. note::
There is likely to be value in adding a generic mechanism to persist *any*
header in the ``X-Object-Massmeta-`` namespace, and adding that prefix to
those blacklisted by the gatekeeper. This would support other middlewares
(such as a keymaster) similarly annotating user metadata with middleware
generated metadata.
For user metadata on containers and accounts we need to store crypto metadata
for each item of user metadata, since these can be independently updated by
POST requests. Here we can use sysmeta to store the crypto metadata items,
e.g. for a user metadata item with key ``X-Container-Meta-Color`` we would
store::
X-Container-Sysmeta-Crypto-Meta-Color: "{'iv': 'ccc', 'cipher': 'AES_CTR_256'}"
7.4 System Metadata
-------------------
System metadata ("sysmeta") will not be encrypted.
Consider a middleware that uses sysmeta for storage. If, for some
reason, that middleware moves from before-crypto to after-crypto in
the pipeline, then all its previously stored sysmeta will become
unreadable garbage from its viewpoint.
Since middlewares sometimes do move, either due to code changes or to
correct an erroneous configuration, we prefer robustness of the
storage system here.
7.5 Summary
-----------
The encrypter will set the following headers on PUT requests to object
servers::
Etag = MD5(ciphertext) (IFF client request included an etag header)
X-Object-Sysmeta-Crypto-Meta-Etag = {'iv': <iv>, 'cipher': <C_req>}
Content-Type = E(OEK, content-type)
X-Object-Sysmeta-Crypto-Meta-ct = {'iv': <iv>, 'cipher': <C_req>}
X-Object-Sysmeta-Crypto-Meta = {'iv': <iv>, 'cipher': <C_req>}
X-Object-Sysmeta-Crypto-Etag = E(OEK, MD5(plaintext))
X-Backend-Container-Update-Override-Etag = \
E(CEK, MD5(plaintext); meta={'iv': <iv>, 'cipher': <C_req>}
X-Backend-Container-Update-Override-Content-Type = \
E(CEK, content-type); meta={'iv': <iv>, 'cipher': <C_req>}
where ``OEK`` is the object encryption key, ``iv`` is a randomly chosen
initialization vector and ``C_req`` is the cipher used while handling this
request.
Additionally, on object PUT or POST requests that include user defined
metadata headers, the encrypter will set::
X-Object-Meta-<user_key> = E(OEK, <user_value>} for every <user-key>
X-Object-Massmeta-Crypto-Meta-<user_key> = {'iv': <iv>, 'cipher': <C_req>}
On PUT or POST requests to container servers, the encrypter will set the
following headers for each user defined metadata header::
X-Container-Meta-<user_key> = E(CEK, <user_value>}
X-Container-Sysmeta-Crypto-Meta-<user_key> = {'iv': <iv>, 'cipher': <C_req>}
Similarly, on PUT or POST requests to account servers, the encrypter will set
the following headers for each user defined metadata header::
X-Account-Meta-<user_key> = E(AEK, <user_value>}
X-Account-Sysmeta-Crypto-Meta-<user_key> = {'iv': <iv>, 'cipher': <C_req>}
where ``AEK`` is the account encryption key.
8. Client-Visible Changes
=========================
There are no known client-visible API behavior changes in this spec.
If any are found, they should be treated as flaws and fixed.
9. Possible Future Work
=======================
9.1 Protection of Internal Network
----------------------------------
Swift's security model is perimeter-based: the proxy server handles
authentication and authorization, then makes unauthenticated requests
on a private internal network to the storage servers. If an attacker
gains access to the internal network, they can read and modify any
object in the Swift cluster, as well as create new ones. It is
possible to use authenticated encryption (e.g. HMAC, GCM) to detect
object tampering.
Roughly, this would involve computing a strong hash (e.g. SHA-384
or SHA-3) of the object, then authenticating that hash. The object
auditor would have to get involved here so that we'd have an upper
bound on how long it takes to detect a modified object.
Also, to prevent an attacker from simply overwriting an encrypted
object with an unencrypted one, the crypto engine would need the
ability to notice a GET for an unencrypted object and return an error.
This implies that this feature is primarily good for clusters that
have always had encryption on, which (sadly) excludes clusters that
pre-date encryption support.
9.2 Other ciphers
-----------------
AES-256 may be considered inadequate at some point, and support for
another cipher will then be needed.
9.3 Client-Managed Keys
-----------------------
CPU-constrained clients may want to manage their own encryption keys
but have Swift perform the encryption. Amazon S3 supports something
like this. Client-managed key support would probably take the form of
a new keymaster.
9.4 Re-Keying Support
---------------------
Instead of using the object key K-obj and computing the ciphertext as
E(k-obj, plaintext), treat the object key as a key-encrypting-key
(KEK) and make up a random data-encrypting key (DEK) for each object.
Then, the object ciphertext would be E(DEK, plaintext), and in system
metadata, Swift would store E(KEK, DEK). This way, if we wish to
re-key objects, we can decrypt and re-encrypt the DEK to do it, thus
turning a re-key operation from a full read-modify-write cycle to a
simple metadata update.
Alternatives
============
Storing user metadata in sysmeta
--------------------------------
To avoid the need to check metadata header limits in the encrypter, encrypted
metadata values could be stored using sysmeta, which is not subject to the same
limits. When handling a GET or HEAD response, the decrypter would need to
decrypt metadata values and copy them back to user metadata headers.
This alternative was rejected because object sysmeta cannot be updated by a
POST request, and so Swift would be restricted to operating in the POST-as-copy
mode when encryption is enabled.
Enforce a single immutable cipher choice per container
------------------------------------------------------
We could avoid storing cipher choice as metadata on every resource (including
individual metadata items) if the choice of cipher were made immutable for a
container or even for an account. Unfortunately it is hard to implement an
immutable property in an eventually consistent system that allows multiple
concurrent operations on distributed replicas of the same resource.
Container storage policy is 'eventually immutable' (any inconsistency is
eventually reconciled across replicas and no replica's policy state may be
updated by a client request). If we made cipher choice a property of a policy
then the cipher for a container could be similarly 'eventually immutable'.
However, it would be possible for objects in the same container to be encrypted
using different ciphers during the any initial window of policy inconsistency
immediately after the container is first created. The existing container policy
reconciler process would need to re-encrypt any object found to have used the
'wrong' cipher, and to do so it would need to know which cipher had been used
for each object, which leads back to cipher choice being stored per-object.
It should also be noted that the IV would still need to be stored for every
resource, so this alternative would not mitigate the need to store crypto
metadata in general.
Furthermore, binding cipher choice to container policy does not provide a means
to guarantee an immutable cipher choice for account metadata.
Implementation
==============
Assignee(s)
-----------
Primary assignees:
| jrichli@us.ibm.com
| alistair.coles@hp.com
References
==========
[1] http://specs.openstack.org/openstack/swift-specs/specs/done/erasure_coding.html
[2] Updating containers on object fast-POST: https://review.openstack.org/#/c/102592/

View File

@ -1,368 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
=============================
Changing Policy of Containers
=============================
Our proposal is to give swift users power to change storage policies of
containers and objects which are contained in those containers.
Problem description
===================
Swift currently prohibits users from changing containers' storage policies so
this constraint raises at least two problems.
One problem is the flexibility. For example, there is an organization using
Swift as a backup storage of office data and all data is archived monthly in a
container named after date like 'backup-201502'. Older archive becomes less
important so users want to reduce the consumed capacity to store it. Then Swift
users will try to change the storage policy of the container into cheaper one
like '2-replica policy' or 'EC policy' but they will be strongly
disappointed to find out that they cannot change the policy of the container
once created. The workaround for this problem is creating other new container
with other storage policy then copying all objects from an existing container
to it but this workaround raises another problem.
Another problem is the reachability. Copying all files to other container
brings about the change of all files' URLs. That makes users confused and
frustrated. The workaround for this problem is that after copying all files to
new container, users delete an old container and create the same name container
again with other storage policy then copy all objects back to the original name
container. However this obviously involves twice as heavy workload and long
time as a single copy.
Proposed change
===============
The ring normally differs from one policy to another so 'a/c/o' object of
policy 1 is likely to be placed in devices of different nodes from 'a/c/o'
object of policy 0. Therefore, objects replacement associated with the policy
change needs very long time and heavy internal traffic. For this reason,
an user request to change a policy must be translated
into asynchronous behavior of transferring objects among storage nodes which is
driven by background daemons. Obviously, Swift must not suspend any
user's requests to store or get information during changing policies.
We need to add or modify Swift servers' and daemons' behaviors as follows:
**Servers' changes**
1. Adding POST container API to send a request for changing a storage policy
of a container
#. Adding response headers for GET/HEAD container API to notify how many
objects are placed in a new policy or still in an old policy
#. Modifying GET/HEAD object API to get an object even if replicas are placed
in a new policy or in an old policy
**Daemons' changes**
1. Adding container-replicator a behavior to watch a container which is
requested to change its storage policy
#. Adding a new background daemon which transfers objects among storage nodes
from an old policy to a new policy
Servers' changes
----------------
1. Add New Behavior for POST Container
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Currently, Swift returns "204 No Content" for the user POST container request
with X-Storage-Policy header. This indicates "nothing done." For the purpose
of maintaining backward compatibility and avoiding accidental execution, we
prefer to remain this behavior unchanged. Therefore, we propose introducing the
new header to 'forcibly' execute policy changing as follows.
.. list-table:: Table 1: New Request Header to change Storage Policy
:widths: 30 8 12 50
:header-rows: 1
* - Parameter
- Style
- Type
- Description
* - X-Forced-Change-Storage-Policy: <policy_name> (Optional)
- header
- xsd:string
- Change a storage policy of a container to the policy specified by
'policy_name'. This change accompanies asynchronous background process
to transfer objects.
Possible responses for this API are as follows.
.. list-table:: Table 2: Possible Response Codes for the New Request
:widths: 2 8
:header-rows: 1
* - Code
- Notes
* - 202 Accepted
- Accept the request properly and start to prepare objects replacement.
* - 400 Bad Request
- Reject the request with a policy which is deprecated or is not defined
in a configuration file.
* - 409 Conflict
- Reject the request because another changing policy process is not
completed yet (relating to 3-c change)
When a request of changing policies is accepted (response code is 202), a
target container stores following two sysmetas.
.. list-table:: Table 3: Container Sysmetas for Changing Policies
:widths: 2 8
:header-rows: 1
* - Sysmeta
- Notes
* - X-Container-Sysmeta-Prev-Index: <int>
- "Pre-change" policy index. It will be used for GET or DELETE objects
which are not transferred to the new policy yet.
* - X-Container-Sysmeta-Objects-Queued: <bool>
- This will be used for determining the status of policy changing by
daemon processes. If False, policy change request is accepted but not
ready for objects transferring. If True, objects have been queued to the
special container for policy changing so those are ready for
transferring. If undefined, policy change is not requested to that
container.
This feature should be implemented as middleware 'change-policy' because of
the following two reasons:
1. This operation probably should be authorized only to limitted group
(e.g., swift cluster's admin (reseller_admin)) because this operation
occurs heavy internal traffic.
Therefore, authority of this operation should be managed in the middleware
level.
#. This operation needs to POST sysmetas to the container. Sysmeta must be
managed in middleware level according to Swift's design principle
2. Add Response Headers for GET/HEAD Container
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Objects will be transferred gradually by backend processes. From the viewpoint
of Swift operators, it is important to know the progress of policy changing,
that is, how many objects are already transferred or still remain
untransferred. This can be accomplished by simply exposing policy_stat table of
container DB file for each storage policy. Each policy's stat will be exposed
by ``X-Container-Storage-Policy-<Policy_name>-Bytes-Used`` and
``X-Container-Storage-Policy-<Policy_name>-Object-Count`` headers as follows::
$ curl -v -X HEAD -H "X-Auth-Token: tkn" http://<host>/v1/AUTH_test/container
< HTTP/1.1 200 OK
< X-Container-Storage-Policy-Gold-Object-Count: 3
< X-Container-Storage-Policy-Gold-Bytes-Used: 12
< X-Container-Storage-Policy-Ec42-Object-Count: 7
< X-Container-Storage-Policy-Ec42-Bytes-Used: 28
< X-Container-Object-Count: 10
< X-Container-Bytes-Used: 40
< Accept-Ranges: bytes
< X-Storage-Policy: ec42
< ...
Above response indicates 70% of object transferring is done.
3. Modify Behavior of GET/HEAD object API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In my current consideration, object PUT should be done only to the new policy.
This does not affect any object in the previous policy so this makes the
process of changing policies simple.
Therefore, the best way to get an object is firstly sending a GET request to
object servers according to the new policy's ring, and if the response code is
404 NOT FOUND, then a proxy resends GET requests to the previous policy's
object servers.
However, this behavior is in discussion because sending GET/HEAD requests twice
to object servers can increase the latency of user's GET object request,
especially in the early phase of changing policies.
Daemons' changes
----------------
1. container-replicator
^^^^^^^^^^^^^^^^^^^^^^^
To enqueue objects to the list for changing policies, some process must watch
what a container is requested for changing its policy. Adding this task to
container-replicator seems best way because container-replicator originally
has a role to seek all container DBs for sanity check of Swift cluster.
Therefore, this can minimize extra time to lock container DBs for adding this
new feature.
Container-replicator will check if a container has
``X-Container-Sysmeta-Objects-Queued`` sysmeta and its value is False. Objects
in that container should be enqueued to the object list of a special container
for changing policies. That special container is created under the special
account ``.change_policy``. The name of a special container should be unique
and one-to-one relationship with a container to which policy changing is
requested. The name of a special container is simply defined as
``<account_name>:<container_name>``. This special account and containers are
accessed by the new daemon ``object-transferrer``, which really transfers
objects from the old policy to the new policy.
2. object-transferrer
^^^^^^^^^^^^^^^^^^^^^
Object-transferrer is newly introduced daemon process for changing policies.
Object-transferrer reads lists of special containers from the account
``.change_policy`` and reads lists of objects from each special container.
Object-transferrer transfers those objects from the old policy to the new
policy by using internal client. After an object is successfully transferred
to the new policy, an object in the old policy will be deleted by DELETE
method.
If transferrer finishes to transfer all objects in a special container, it
deletes a special container and deletes sysmetas
``X-Container-Sysmeta-Prev-Index`` and ``X-Container-Sysmeta-Objects-Queued``
from a container to change that container's status from IN-CHANGING to normal
(POLICY CHANGE COMPLETED).
Example
-------
.. list-table:: Table 4: Example of data transition during changing policies
:widths: 1 4 2 4 2
:header-rows: 1
* - Step
- Description
- Container /a/c
objects
- Container /a/c/ metadata
- Container /.change_policy/a:c
objects
* - | 0
- | Init.
- | ('o1', 1)
| ('o2', 1)
| ('o3', 1)
- | X-Backend-Storage-Policy-Index: 1
- | N/A
* - | 1
- | POST /a/c X-Forced-Change-Storage-Policy: Pol-2
- | ('o1', 1)
| ('o2', 1)
| ('o3', 1)
- | X-Backend-Storage-Policy-Index: 2
| X-Container-Sysmeta-Prev-Policy-Index: 1
| X-Container-Sysmeta-Objects-Queued: False
- | N/A
* - | 2
- | container-replicator seeks policy changing containers
- | ('o1', 1)
| ('o2', 1)
| ('o3', 1)
- | X-Backend-Storage-Policy-Index: 2
| X-Container-Sysmeta-Prev-Policy-Index: 1
| X-Container-Sysmeta-Objects-Queued: True
- | ('o1', 0, 'application/x-transfer-1-to-2')
| ('o2', 0, 'application/x-transfer-1-to-2')
| ('o3', 0, 'application/x-transfer-1-to-2')
* - | 3
- | object-transferrer transfers 'o1' and 'o3'
- | ('o1', 2)
| ('o2', 1)
| ('o3', 2)
- | X-Backend-Storage-Policy-Index: 2
| X-Container-Sysmeta-Prev-Policy-Index: 1
| X-Container-Sysmeta-Objects-Queued: True
- | ('o2', 0, 'application/x-transfer-1-to-2')
* - | 4
- | object-transferrer transfers 'o2'
- | ('o1', 2)
| ('o2', 2)
| ('o3', 2)
- | X-Backend-Storage-Policy-Index: 2
| X-Container-Sysmeta-Prev-Policy-Index: 1
| X-Container-Sysmeta-Objects-Queued: True
- | Empty
* - | 5
- | object-transferrer deletes a special container and metadatas from
container /a/c
- | ('o1', 2)
| ('o2', 2)
| ('o3', 2)
- | X-Backend-Storage-Policy-Index: 2
- | N/A
Above table focuses data transition of a container in changing a storage policy
and a corresponding special container. A tuple indicates object info, first
element is an object name, second one is a policy index and third one, if
available, is a value of content-type, which is defined for policy changing.
Given that three objects are stored in the container ``/a/c`` as policy-1
(Step 0). When the request to change this container's
policy to policy-2 is accepted (Step 1), a backend policy index will be
changed to 2 and two sysmetas are stored in this container. In the periodical
container-replicator process, replicator finds a container with policy change
sysmetas and then creates a special container ``/.change_policy/a:c`` with
a list of objects (Step 2). Those objects have info of old policy and new policy
with the field of content-type. When object-transferrer finds this special
container from ``.change_policy`` account, it gets some objects from the old
policy (usually from a local device) and puts them to the new policy's storage
nodes (Step 3 and 4). If the special container becomes empty (Step 5), it
indicates policy changing for that container finished so the special container
is deleted and policy changing metadatas of an original container are also
deleted.
Alternatives: As Sub-Function of Container-Reconciler
-----------------------------------------------------
Container-reconciler is a daemon process which restores objects registered in
an incorrect policy into a correct policy. Therefore, the reconciling procedure
satisfies almost all of functional requirements for policy changing. The
advantage of using container-reconciler for policy changing is that we need to
modify a very few points of existing Swift sources. However, there is a big
problem to use container-reconciler. This problem is that container-reconciler
has no function to determine the completeness of changing policy of objects
contained in a specific container. As a result, this problem makes it
complicated to handle GET/HEAD object from the previous policy and to allow
the next storage policy change request. Based on discussion in Swift hack-a-thon
(held in Feb. 2015) and Tokyo Summit (held in Oct. 2015), we decided to add
object-transferrer to change container's policy.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Daisuke Morita (dmorita)
Milestones
----------
Target Milestone for completion:
Mitaka
Work Items
----------
* Add API for Policy Changing
* Add a middleware 'policy-change' to process Container POST request with
"X-Forced-Change-Storage-Policy" header. This middleware stores sysmeta
headers to target container DB for policy changing.
* Modify container-server to add response headers for Container GET/HEAD
request to show the progress of changing policies by exposing all the info
from policy_stat table
* Modify proxy-server (or add a feature to new middleware) to get object for
referring both new and old policy index to allow users' object read during
changing policy
* Add daemon process among storage nodes for policy changing
* Modify container-replicator to watch a container if it should be initialized
(creation of a corresponding special container) for changing policies
* Write object-transferrer code
* Daemonize object-transferrer
* Add unit, functional and probe tests to check that new code works
intentionally and that it is OK for splitted brain cases

View File

@ -1,588 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
This template should be in ReSTructured text. Please do not delete
any of the sections in this template. If you have nothing to say
for a whole section, just write: "None". For help with syntax, see
http://sphinx-doc.org/rest.html To test out your formatting, see
http://www.tele3.cz/jbar/rest/rest.html
===============================
Container Sharding
===============================
Include the URL of your blueprint:
https://blueprints.launchpad.net/swift/+spec/container-sharding
A current limitation in swift is the container database. The SQLite database
stores the name of all the objects inside the container. As the amount of
objects in a container grows, so does the size of the database file. This causes
higher latency due to the size and reading on the single file, and can be improved
by using container sharding.
Over the last year, there has been a few POC's covered, the last POC was using
distributed prefix trees, which although worked well (kept order and adding infinite sharding)
while at the last hackathon (August in Austin), it was found that it required too many requests. In
smaller or not high load clusters this would have been fine, but talking to users running clusters
in high load, this approach only added to their problems. The code for this approach can be found in
the sharding_trie branch of https://github.com/matthewoliver/swift/.
After discussions at the hackathon, it was decided we should try a similar but simpler approach. Which
I am calling the Pivot Range approach. This POC is being worked on in the sharding_range branch.
https://github.com/matthewoliver/swift/tree/sharding_range
Problem Description
===================
The SQLite database used to represent a container stores the name of all the objects
contained within. As the amount of objects in a container grows, so does the size of
the database file. Because of this behaviour, the current suggestion for clusters storing
many objects in a single container is to make sure the container databases are stored on
SSDs, to reduce latency when accessing large database files.
In a previous version of this spec, I investigated different approaches we could use
to shard containers. These were:
#. Path hashing (part power)
#. Consistent Hash ring
#. Distributed prefix trees (trie)
#. Pivot/Split tree (Pivot Ranges)
In discussions about the SPEC at the SFO Swift Hackathon, distributed prefix trees (trie)
became the forerunner. More recently at the Austin hackathon the prefix trie approach though
worked would cause more requests and on larger highly loaded clusters, may actually cause more
issues then it was solving.
It was decided to try a similar but simplified approach, which I'm calling the pivot (or split)
tree approach. This is what this version of the spec will be covering.
When talking about splitting up the objects in a container, we are only talking about the container metadata, not the objects themselves.
The Basic Idea
=================
The basic and simplified idea is rather simple. Firstly, to enable container sharding pass in a
"X-Container-Sharding: On" X-Header via either PUT or POST::
curl -i -H 'X-Auth-Token: <token>' -H 'X-Container-Sharding: On' <url>/<account>/<container> -X PUT
Once enabled when a container gets too full, say at 1 million objects. A pivot point is found
(the middle item) which will be used to split the container. This split will create 2 additional containers each
holding 1/2 the objects. The configuration parameter `shard_container_size` determines what size a container can get to before it's sharded (defaulting to 1 million).
All new containers created when splitting exist in a separate account namespace based off the users account. Meaning the user will only
ever see 1 container, this we call the root container. The sharded namespace is::
.sharded_<account>/
Containers that have been split no longer hold object metadata and so once the new containers are durable can be deleted (except for the root container).
The root container, like any other split container, contains no objects in it's ``object`` table however it
has a new table to store the pivot/range information. This information can be used to easily and quickly
determine where meta should live.
The pivot (split) tree, ranges and Swift sharding
=====================================================
A determining factor in what sharding technique we chose was that having a consistent order is
important, a prefix tree is a good solution, but we need something even simpler. Conceptually we
can split container in two on a pivot (middle object) in the object list, turning the resulting
prefix tree into a more basic binary tree. In the initial version of this new POC, we had a class called
the PivotTree, which was a binary tree with the extra smarts we needed.. but as development went on,
maintaining a full tree became more complex, we were only storing the pivot node (to save space).
Finding the bounds of what should belong in a part of the tree (for misplaced object checks, see later)
became rather complicated.
We have since decided to simplify the design again and store a list of ranges (like encyclopaedia's), which
still behaves like a binary tree (the searching algorithm) but also greatly simplifies parts the sharding
in Swift.
The pivot_tree version still exists (although incomplete) in the pivot_tree branch.
Pivot Tree vs Pivot Range
----------------------------
Let's start with a picture, this is how the pivot tree worked:
.. image:: images/PivotPoints.png
Here, the small circles under the containers represent the point on which the container was pivoted,
and thus you can see the pivot tree.
The picture was one I used in the last spec, and also demonstrates how the naming of a sharded container
is defined and how they are stored in the DB.
Looking at the ``pivot_points`` table from the above image, you can see that the original container '/acc/cont' has been split a few times:
* First it pivoted at 'l', which would have created 2 new sharded containers (cont.le.l and cont.gt.l).
* Second, container /.sharded_acc/cont.le.l was split at pivoted 'f' creating cont.le.f and cont.gt.f.
* Finally the cont.gt.l container also split pivoting on 'r' creating cont.le.r and cont.gt.r.
Because it is essentially a binary tree we can infer the existence of these additional 6 containers with just 3 pivots in the pivot table. The level of the pivot tree each pivot lives is also stored so we are sure to build the tree correctly whenever it's needed.
The way the tree was stored in the database was basically a list and the tree needed to be built. In the
range approach, we just use a list of ranges. A rather simple PivotRange class was introduced which
has methods that makes searching ranges and thus the binary search algorithm simple.
Here is an example of the same data stored in PivotRanges:
.. image:: images/PivotRanges.png
As you can see from this diagram, there is more records in the table, but it is simplified.
The bytes_used and object_count stored in the database may look confusing, but this is so we can keep track
of these statistics in the root container without having to go visit each node. The container-sharder will update these stats as it visits containers.
This keeps the sharded containers stats vaguely correct and eventually consistent.
All user and system metadata only lives in the root container. The sharded containers only hold some metadata which help the sharder in it's work and in being able to audit the container:
* X-Container-Sysmeta-Shard-Account - This is the original account.
* X-Container-Sysmeta-Shard-Container - This is the original container.
* X-Container-Sysmeta-Shard-Lower - The lower point of the range for this container.
* X-Container-Sysmeta-Shard-Upper - The upper point of the range for this container.
Pivot point
--------------
The Pivot point is the middle object in the container. As Swift is eventually consistent all the containers
could be in flux and so they may not have the same pivot point to split on. Because of this something needs to make the decision. In the initial version of the POC, this will be one of the jobs of the container-sharder.
And to do so is rather simple. It will query each primary copy of the container asking for what they think the
pivot point is. The sharder will choose the container with the most objects (how it does this will be explained in more detail in the container-sharder section).
There is a new method in container/backend.py called ``get_possible_pivot_point`` which does exactly what
you'd expect, finds the pivot point of the container, it does this via querying the database with::
SELECT name
FROM object
WHERE deleted=0 LIMIT 1 OFFSET (
SELECT reported_object_count / 2
FROM container_info);
This pivot point is placed in container_info, so is now easily accessible.
PivotRange Class
-----------------
Now that we are storing a list of ranges, and as you probably remember from the initial picture we only store the lower and upper of this range. We have have a class that makes dealing with ranges simple.
The class is pretty basic, it stores the timestamp, lower and upper values. `_contains_`, `_lt_`, `_gt_` and `_eq_` have been overrided, to do checks against a string or another PivotRange.
The class also contains some extra helper methods:
* newer(other) - is it newer then another range.
* overlaps(other) - does this range overlap another range.
The PivotRange class lives in swift.common.utils, and there are some other helper methods there that are used:
* find_pivot_range(item, ranges) - Finds what range from a list of ranges that an item belongs.
* pivot_to_pivot_container(account, container, lower=None, upper=None, pivot_range=None) - Given a root container account and container and either lower and upper or just a pivot_range generate the required sharded name.
Getting PivotRanges
--------------------
There are two ways of getting a list of PivotRanges and it depends on where you are in swift. The easiest and most obvious way is to use a new method in the ContainerBroker `build_pivot_ranges()`.
The second is to ask the container for a list of pivot nodes rather than objects. This is done with a simple
GET to the container server, but with the nodes=pivot parameter sent::
GET /acc/cont?nodes=pivot&format=json
You can then build a list of PivotRange objects. And example of how this is done can be seen in the
`_get_pivot_ranges` method in the container sharder daemon.
Effects to the object path
-------------------------------
Proxy
^^^^^^^^^
As far as the proxy is concerned nothing has changed. An object will always hashed with the root container,
so no movement of object data is required.
Object-Server and Object-Updater
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Object-server and Object-Updater (async-pending's) need some more smarts because they need to update the
correct shard. In the current POC implementation, these daemons don't actually need to be shard aware,
they just be know what to do if a container server responds with a HTTPMovedPermanently (301),
as the following picture demonstrates:
.. image:: images/seq_obj_put_delete.png
This is accomplished by getting the container-server to set the required X-Container-{Host, Device, Partition}
headers in the response that the object-{server, updater} require to redirect it's update.
Only one new host is added to the headers, the container server determines which one by picking the
primary node of the new partition that sits in at the same index as itself.
This helps stop what a call a request storm.
Effects to the container path
---------------------------------
PUT/POST
^^^^^^^^^
These remain unaffected. All container metadata will be stored with the root container.
GET/HEAD
^^^^^^^^^
Both GET and HEADs get much more complicated.
HEAD's need to return the bytes_used and object_count stats on the container. The root container doesn't have any objects, so we need to either:
* Visit every shard node and build the stats... this is very very expensive; or
* Have a mechanism of updating the stats on a regurlar basis, and they can lag a little.
The latter of these was chosen and the POC stores the stats of each shard in the root pivot_ranges table which gets updated during each sharding pass (see later).
On GETs, additional requests will be required to hit leaf nodes to gather and build the object listings.
We could make these extra requests from the proxy or the container server, both have their pros and cons:
In the proxy:
* Pro: this has the advantage of making new requests from the proxy, being able to probe and build a response.
* Con: The pivot tree of the root container needs to be passed back to the proxy. Because this tree could grow, even if only passing back the leaves, passing a tree back would have to be in a body (as the last POC using distributed prefix trees implemented) or in multiple headers.
* Con: The proxy needs to merge the responses from the container servers hit, meaning needing to understand XML and json. Although this could be resolved by calling format=json and moving the container server's create_listing method into something more global (like utils).
In the container-server:
* Pro: No need to pass the root containers pivot nodes (or just leaves) as all the action will happen with access to the root containers broker.
* Pro: Happening in the container server means we can call format=json on shard containers and then use the root container's create_listing method to deal with any format cleanly.
* Con: Because it's happening on the container-servers, care needs to be given in regard to requests. We don't want to send out replica additional requests when visiting each leaf container. Otherwise we could generate a kind of request storm.
The POC is currently using container-server, keeping the proxy shard aware free (which is pretty cool).
DELETE
^^^^^^^
Delete has the same options as GET/HEAD above, either it runs in the proxy or the container-server. But the general idea will be:
* Receive a DELETE
* Before applying to the root container, go apply to shards.
* If all succeed then delete root container.
Container delete is yet to be implemented in the POC.
The proxy base controller has a bunch of awesome code that deals with quorums and best responses, and if
we put the DELETE code in the container we'd need to replicate it or do some major refactoring. This isn't great but might be the best.
On the other hand having shard DELETE code in the proxy suddenly makes the proxy shard aware..
which makes it less cool.. but definitely makes the delete code _much_ simpler.
**So the question is:** `Were should the shard delete code live?`
Replicater changes
--------------------
The container-replicator (and db_replicator as required) has been updated to replicate and sync the pivot_range table.
Swift is eventually consistent, meaning at some point we will have an unsharded version of a container replicate with a sharded one, and being eventually consistent, some of the objects in the un-sharded might actually exist and need to be merged into a lower down shard.
The current thinking is that a sharded container holds all it's objects in the leaves. Leaving the root and branch container's object table empty, these non-leaves also will not be queried when object listing. So the plan is:
#. Sync the objects from the unsharded container into the objects table of the root/branch container.
#. Let the container-sharder replicate the objects down to the correct shard. (noting that dealing with misplaced objects in a shard is apart of the sharder's job)
pending and merge_items
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This version of the POC will take advantage of the last POC's changes to replication. They will at least suffice while it's a POC.
The merge_items method in the container/backend.py has been modified to be pivot_points aware. That is to say, the list of items
passed to it can now contain a mix of objects and pivot_nodes. A new flag will be added to the pending/pickle file format
called record_type, which defaults to RECORD_TYPE_OBJECT in existing pickle/pending files when unpickled. Merge_items
will sort into 2 different lists based on the record_type, then insert, update, delete the required tables accordingly.
TODO - Explain this in more detail and maybe a diagram or two.
Container replication changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Because swift is an eventually consistent system, we need to make sure that when container databases are replicated, this doesn't
only replicate items in the objects table, but also the nodes in the pivot_points table. Most of the database replication code
is apart of the db_replicator which is a parent and so shared by account and container replication.
The current solution in the POC, is to add an _other_items_hook(broker) hook that is over-written in the container replicator
to grab the items from the pivot_range table and returned in the items format to be passed into merge_items.
There is a caveat however, which is that currently the hook grabs all the objects from the pivot_points table.
There is no notion of a pointer/sync point. The number of pivot_point should remain fairly small, at least in relation to objects.
.. note:: We are using an other_items hook, but this can change once we get around to sharding accounts. In which case we can simply update the db_replicator to include replicating the list of ranges properly.
Container-sharder
-------------------
The container-sharder, will run on all container-server nodes. At an interval it will parse all shared containers,
on each it:
* Audits the container
* Deals with any misplaced items. That is items that should belong in a different range container.
* We then check the size of the container, when we do _one_ of the following happens:
* If the container is big enough and hasn't already a pivot point defined, determine a pivot point.
* If the container is big enough and has a pivot point defined, then split (pivot) on it.
* If the container is small enough (see later section) then it'll shrink
* If the container isn't too big or small so just leave it.
* Finally the containers `object_count` and `bytes_used` is sent to the root container's `pivot_ranges` table.
As the above alludes to, sharding is a 2 phase process, on the first shard pass the container will get a
pivot point, the next time round it will be sharded (split). Shrinking is even more complicated, this two
is a 2 phase process, but I didn't want to complicate this initial introduction here. See the shrinking
section below for more details.
Audit
^^^^^^
The sharder will perform a basic audit of which simply makes sure the current shard's range exists in the root's `pivot_ranges` table. And if its the root container, check to see if there are any overlap or missing ranges.
The following truth table was from the old POC spec. We need to update this.
+----------+------------+------------------------+
| root ref | parent ref | Outcome |
+==========+============+========================+
| no | no | Quarantine container |
+----------+------------+------------------------+
| yes | no | Fix parent's reference |
+----------+------------+------------------------+
| no | yes | Fix root's reference |
+----------+------------+------------------------+
| yes | yes | Container is fine |
+----------+------------+------------------------+
Misplaced objects
^^^^^^^^^^^^^^^^^^^
A misplaced object is an object that is in the wrong shard. If it's a branch shard (a shard that has split), then anything in the object table is
misplaced and needs to be dealt with. On a leaf node, a quick SQL statement is enough to find out all the objects
that are on the wrong side of the pivot. Now that we are using ranges, it's easy to determine what should and shouldn't be in the range.
The sharder uses the container-reconciler/replicator's approach of creating a container database locally in a handoff
partition, loading it up, and then using replication to push it over to where it needs to go.
Splitting (Growing)
^^^^^^^^^^^^^^^^^^^
For the sake of simplicity, the POC uses the sharder to both determine and split. It does this is a 2-phase process which I have already alluded to above.
On each sharding pass, all sharded containers local to this container server are checked. On each check the container is audited and any misplaced items are dealt with.
Once that's complete only *one* of the following actions happen, and then the sharder moves onto the next container or finishes it's pass:
* **Phase 1** - Determine a pivot point: If there are enough objects in the container to warrant a spit and a pivot point hasn't already been determined then we need to find one, it does this by:
* Firstly find what the local container thinks is the best pivot point is and it's object count (it can get these from broker.get_info).
* Then query the other primary nodes to get their best pivot point and object count.
* We compare the results, the container with the most objects wins, in the case of a tie, the one that reports the winning object count first.
* Set X-Container-Sysmeta-Shard-Pivot locally and on all nodes to the winning pivot point.
* **Phase 2** - Split the container on the pivot point: If X-Container-Sysmeta-Shard-Pivot exists then we:
* Before we commit to splitting ask the other primary nodes and make sure there a quorum (replica / 2 + 1) of which they agree on the same pivot point.
* If we reach quorum, then it's safe to split. In which case we:
* create new containers locally
* fill them up while delete the objects locally
* replicate all the containers out and update the root container of the changes. (Delete old range, and add to 2 new).
.. note::
When deleting objects from the container being split the timestamp used is the same as the existing object but using Timestamp.internal with the offset incremented. Allowing newer versions of the object/metadata to not be squashed.
Noting this incase this collides with fast post work acoles has been working on.. I'll ask him at summit.
* Do nothing.
Shrinking
^^^^^^^^^^^^^
Turns out shrinking (merging containers back when they get too small) is even more complicated then sharding (growing).
When sharding, we at least have all the objects that need to shard all on the container server we were on.
When shrinking, we need to find a range neighbour that most likely lives somewhere else.
So what does the current POC do? At the moment it's another 2 phase procedure. Although while writing this SPEC update I think this might have to become a 3 phase as we probably need an initial state to do nothing but
let Swift know something will happen.
So how does shrinking work, glad you asked. Firstly shrinking happens during the sharding pass loop.
If a container has too few items then the sharder will look into the possibility of shrinking the container.
Which starts at phase 1:
* **Phase 1**:
* Find out if the container really has few enough objects, that is a quorum of counts below the threshold (see below).
* Check the neighbours to see if it's possible to shrink/merge together, again this requires getting a quorum.
* Merge, if possible, with the smallest neighbour.
* create a new range container, locally.
* Set some special metadata on both the smallest neighbour and on the current container.
* `X-Container-Sysmeta-Shard-Full: <neighbour>`
* `X-Container-Sysmeta-Shard-Empty: <this container>`
* merge objects into the metadata full container (neighbour), update local containers.
* replicate, and update root container's ranges table.
* let misplaced objects and replication do the rest.
* **Phase 2** - On the storage node the other neighbour is on (full container), when the sharder hits it then:
* Get a quorum what the metadata is still what is says.. though it might be too late if it isn't).
* Create a new container locally in a handoff partition.
* Load with all the data (cause we want to name the container properly) while deleting locally.
* Send range updates to the root container.
* Delete both old containers and replicate all three containers.
The potential extra phase I see might be important here would be to set the metadata only as phase 1 to let the rest of Swift know something will be happening. The set metadata is what Swift checks for in the areas it need to be shrink aware.
.. node::
In phase 2, maybe an actual filesystem copy would be faster and better then creating and syncing. Also again we have the space vs vacuum issue.
Small enough
~~~~~~~~~~~~~
OK, so that's all good and fine, but what is small enough, both from the container and small enough neighbour?
Shrinking has added 2 new configuration parameters to the container-sharder config section:
* `shard_shrink_point` - percentage of `shard_container_size` (default 1 million) that a container is deemed small enough to try and shrink. Default 50 (note no % sign).
* `shard_shrink_merge_point` - percentage of `shard_container_size` that a container will need to be below after 2 containers have merged. Default 75.
These are just numbers I've picked out of the air. But are tunable. The idea is, taking the defaults,
when a container gets < 50% of shard_container_size, then the sharder will look to see if there are any neighbours
that when its object count added to itself is < 75% of shard_container_size then merge with it. If it can't
find a neighbour what will be < 75% then we can't shrink and the container will have to stay as it is.
shard aware
~~~~~~~~~~~~
The new problem is things now need to be shrink aware. Otherwise we can get ourselves in a spot of danger:
* **Container GET** - Needs to know if it hits an shrink `empty` container to look in the shrink `full` container for the empty containers object metadata.
* Sharding or shrinking should not touch a container that is in a shrinking state. That is if it's either the emtpy or full container.
* **Sharder's misplaced objects** - A shrink full container will obviously look like it has a bunch of objects that don't belong in the range, so misplaced objects needs to know about this state otherwise we'll have some problems.
* **Container server 301 redirects** - We want to make sure that when finding the required container to update in a the 301 response, if it happens to be an empty container we need to redirect to the full one. (or do we, maybe this doesn't matter?).
* **Container shard delete** - an empty container now has 0 objects, and could be deleted. When we delete a container all the metadata is lost, including the empty and full metadata.. this could cause some interesting problems. (This hasn't been implemented yet, see problem with deletes)
Space cost of Sharding or to Vacuum
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When we split, currently we:
* create the 2 new containers locally in handoff partitions.
* Split objects into the new containers keeping their timestamps. At the same time delete the objects in the container being split by setting deleted = 1 and setting the timestamp to the object timestamp + offset.
* replicate all three containers.
.. note::
Maybe a good compromise here would be instead of splitting and filling up 2 new containers completely and
then replcating, maybe a smarter way would be to create the new containers, fill them up a bit (shard_max_size) then replicate, rinse, repeat.
We set the deleted = 1 timestamps to the existing objects timestamp + offset, because there could be a container out there that is out of sync with an updated object record we want to keep.
In which case it'll override the deleted one in the splitting container, and then get moved to the new shard container via the sharder's misplaced items method.
The problem we have here is it means sharding container, especially really large containers, takes up _alot_ of room.
To shard a container, we need to double the size on disk the spitting container takes due to the inserting
of object into the new containers _and_ them still existing in the original with the deleted flag set.
We either live with this limitation.. or try and keep size on disk to a minimum when sharding.
Another option is to:
* create the 2 new containers locally in handoff patitions.
* Split objects into the new containers keeping their timestamps. At the same time deleting the original objects (DELETE FROM).
* Replicate all three containers.
Here in the second option, we would probably need to experiment with how often we would need to vacuum,
otherwise there is a chance that the database on disk, even though we are using `DELETE FROM` may still remain the same size.
Further in the case of this old container syncing with a replica that is out of date would mean _all_ objects
in the out of date container being merged into the old (split) container which would need all need to be rectified in merge items.
This too could be very costly.
Sharded container stats
^^^^^^^^^^^^^^^^^^^^^^^^
As you would expect, if we simply did a HEAD of the root container. The `bytes_used` and `object_count` stats
would come back at 0. This is because when sharded the root container doesn't have any objects in it's
objects table, they've been sharded away.
The last time the very slow and expensive approach of propagating the HEAD to every container shard and then collating the results would happen. This is *_VERY_* expensive.
We discussed this in Tokyo, and the solution was to update the counts every now and again. Because we are
dealing with container shards that are also replicated, there are alot of counts out there to update, and this gets complicated when they all need to update a single count in the root container.
Now the pivot_ranges table also stores the "current" count and bytes_used for each range, as each range represents a sharded container, we now have a place to update individually::
CREATE TABLE pivot_ranges (
ROWID INTEGER PRIMARY KEY AUTOINCREMENT,
lower TEXT,
upper TEXT,
object_count INTEGER DEFAULT 0,
bytes_used INTEGER DEFAULT 0,
created_at TEXT,
deleted INTEGER DEFAULT 0
);
When we container HEAD the root container all we need to do is sum up the columns.
This is what the ContainerBroker's `get_pivot_usage` method does with a simple SQL statement::
SELECT sum(object_count), sum(bytes_used)
FROM pivot_ranges
WHERE deleted=0;
Some work has been done to be able to update these `pivot_ranges` so the stats can be updated.
You can now update them through a simple PUT or DELETE via the container-server API.
The pivot range API allows you to send a PUT/DELETE request with some headers to update the pivot range, these headers are:
* x-backend-record-type - which must be RECORD_TYPE_PIVOT_NODE, otherwise it'll be treated as an object.
* x-backend-pivot-objects - The object count, which prefixed with a - or + (More on this next).
* x-backend-pivot-bytes - The bytes used of the range, again can be prefixed with - or +.
* x-backend-pivot-upper - The upper range, lower range is the name of the object in the request.
.. note::
We use x-backend-* headers becuase these should only be used by swift's backend.
The objects and bytes can optionally be prefixed with '-' or '+' when they do they effect the count accordingly.
For example, if we want to define a new value for the number of objects then we can::
x-backend-pivot-objects: 100
This will set the number for the `object_count` stat for the range to 100. The sharder sets the new count and bytes like this during each pass to reflect the current state of the world, seeing it knows best at the time.
The API however allows a request of::
x-backend-pivot-object: +1
This would increment the current value. In this case it would make the new value 101. A '-' will decrement.
The idea behind this is if an Op wants to sacrifice more requests in the cluster with more uptodate stats, we could get the object-updaters and object-servers to send a + or - once an object is added or deleted. The sharder would correct the count if it's gets slightly out of sync.
The merge_items method in the ContainerBroker will merge prefixed requests together (+2 etc) if required.
Assignee(s)
-----------
Primary assignee:
mattoliverau
Other assignees:
blmartin
Work Items
----------
Work items or tasks -- break the feature up into the things that need to be
done to implement it. Those parts might end up being done by different people,
but we're mostly trying to understand the timeline for implementation.
Repositories
------------
No new repositories required.
Services
---------
A container-sharder daemon has been created to shard containers in the background
Documentation
-------------
Will this require a documentation change? YES
If so, which documents? Deployment guide, API references, sample config files
(TBA)
Will it impact developer workflow? The limitations of sharded containers,
specifically object order, will effect DLO and existing swift app developer
tools if pointing to a sharded container.
Security
--------
Does this introduce any additional security risks, or are there
security-related considerations which should be discussed?
TBA (I'm sure there are, like potential sharded container name collisions).
Testing
-------
What tests will be available or need to be constructed in order to
validate this? Unit/functional tests, development
environments/servers, etc.
TBA (all of the above)
Dependencies
============
TBA

View File

@ -1,140 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
=================
Container aliases
=================
A container alias makes it possible to link to other containers, even to
containers in different accounts.
Problem Description
===================
Currently it is more complicated to access containers in other accounts than
containers defined in the account returned as your storage URL because you
need to use a different storage URL than returned by your auth-endpoint -
which is known to not be support by all clients. Even if the storage URL of a
shared container which you have access is known and supported by the client of
choice - shared containers are not listed when doing a GET request on the
users account, thus they are not discoverable by a regular client applications
or users.
Alias container could simplify this task. A swift account owner/admin with
permissions to create containers could create an alias onto a container which
users of the account already have access to (most likely via ACL's), and
requests rooted at or under this alias could be redirected or proxied to a
second container on a different account.
This would make it simpler to access these containers with existing clients
for different reasons.
#. A GET request on the account level would list these containers
#. Requests to an alias container are forwarded to the target container,
making it possible to access that container without using a different
storage URL in the client.
However, setting the alias still requires the storage URL (see
`Automatic container alias provisioning`_ for alternative future work).
Caveats
=======
Setting an alias should be impossible if there are objects in the source
container because these would become inaccessible, but still require storage
space. There is a possible race condition if a container is created and
objects are stored within while at the same time (plus a few milliseconds?) an
alias is set.
A reconciler mechanism (similar to the one used in storage policies) might
solve this, as well as ensuring that the alias can be only set during
container creation. Un-setting alias would be denied, instead the alias
container is to be deleted.
Proposed Change
===============
New metadata to set and review, as well as sys-metadata to store - the target
container on a container alias.
Most of the required changes can be put into a separate middleware. There is an
existing patch: https://review.openstack.org/#/c/62494
.. note::
The main problem identified with that patch was that a split brain could
allow a container created on handoffs WITHOUT an alias to shadow a
pre-existing alias container, and during upload could cause the user
perception of to which location data was written to be confused and
potentially un-resolved.
It's been purposed that reconciliation process to move objects in an alias
container to the target container could allow an eventually consistent repair
of the split-brain'd container.
Security
========
Test and verify what happens if requests are not yet authenticated; make sure
ACLs are respected and unauthorized requests to containers in other accounts is
impossible.
The change should include functional tests which validate cross-account and
non-swift owner cross-container alias's correctly respect target ACL's - even
if in some cases they appear to duplicate the storage-url based
cross-account/cross-container ACL tests.
If a background process is permitted to move object stored in a container
which is later determined to have been an alias there is likely to be
authorization implications if the ACL's on the target have changed.
Documentation
--------------
Update the documentation and document the behavior.
Work Items
----------
Further discussion of design.
Assignee(s)
-----------
Primary assignee:
cschwede <christian.schwede@enovance.com>
Future Work
===========
Automatic container alias provisioning
--------------------------------------
Cross-account container sharing might be even more simplified, leading to a
better user experience.
Let's assume there are two users in different accounts:
``test:tester`` and ``test2:tester2``
If ``test:tester`` puts an ACL onto an container ``/AUTH_test/container`` to
allow access for ``test2:tester2``, the middleware could create an alias
container ``/AUTH_test2/test_container`` linking to ``/AUTH_test/container``.
This would make it possible to discover shared containers to other
users/accounts. However, there are two challenges:
1. Name conflicts: there might be an existing container
``/AUTH_test2/test_container``
2. A lookup would require to resolve an account name into the account ID
Cross realm container aliases
-----------------------------
Might be possible to tie into container sync realms (or something similar) to
allow operators the ability to let users proxy requests to other realms.

View File

@ -1,79 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
Scaling Expiring Objects
========================
Problem description
-------------------
The object expirer daemon does not process objects fast enought
when there are a large amount of files that need to expire.
This leads to situtions like:
- Objects that give back 404s upon requests, but are still in showing in the
container listing.
- Objects not being deleted in a timely manner.
- Object expirer passes never completing.
Problem Example
---------------
Imagine a client is PUTting 1000 object a second spread out over 10 containers into
the cluster. First on the PUT we are using double the container resources of the
cluster, because of the extra PUT to the .expiring_objects account. Then when we
start deleting the objects we double the strain of the container layer again. The
customers containers now have to handle the 100 PUTs/sec and 100 DELETEs a second
from the expirer daemon. If it cant keep up the daemon begins to gets behind.
If there are no changes to this system the daemon will never catch up- in addition
to this other customers will begin to be starved for resources as well.
Proposed change(s)
------------------
There will need to be two changes needed to fix the problem described.
1.) Allow for the container databases to know whether an object is expired.
This will allow for the container replicater to keep the object counts correct.
2.) Allow the auditor to delete objects that have expired during its pass.
This will allow for the removal of the object expirer daemon.
Implementation Plan
-------------------
There are multiple parts to the implementation. The updating of the container
database to remove the expired objects and the removal of the object from disk.
Step 1:
A expired table will be added to the container database. There will be a
'obj_row_id' and 'expired_at' column on the table. The 'obj_row_id' column will
correlate to the row_id for an object in the objects table. The 'expired_at'
column will be an integer timestamp of when the object expires.
The container replicator will remove the object rows from objects table when
their corresponding 'expire_at' time in the expired table is before the start
time of the pass. There will be a trigger to delete row(s) in the 'expired'
table after the deletion of row(s) out of the 'objects' table. Once, the
removal of the expired objects are complete the container database will
be replicated.
Step 2:
The object auditor as it makes its pass will remove any expired objects.
When the object auditor inspects an object's metadata, if the X-Delete-At is
before the current time, the auditor will delete the object. Due to slow auditor
passes, the cluster will have extra data until the objects get processed.
Rollout Plan
------------
When deploying this change the current expirer deamon can contiue to run until
all objects are removed from the '.expiring_objects' account. Once that is done
the deamon can be stopped.
Also, a script for updating the container databases with the 'expire_at' times
for all the objects with will be created.
Assignee(s)
-----------
Primary assignee:
(aerwin3) Alan Erwin alan.erwin@rackspace.com

View File

@ -1,830 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
This template should be in ReSTructured text. Please do not delete
any of the sections in this template. If you have nothing to say
for a whole section, just write: "None". For help with syntax, see
http://sphinx-doc.org/rest.html To test out your formatting, see
http://www.tele3.cz/jbar/rest/rest.html
=======================================
Resolving limitations of fast-POST
=======================================
The purpose of this document is to describe the requirements to enable
``object_post_as_copy = false`` in the proxy config as a
reasonable deployment configuration in Swift once again, without
sacrificing the features enabled by ``object_post_as_copy = true``.
For brevity we shall use the term 'fast-POST' to refer to the mode of operation
enabled by setting ``object_post_as_copy = false``, and 'POST-as-COPY' to refer
to the mode when ``object_post_as_copy = true``.
Currently using fast-POST incurs the following limitations:
#. Users can not update the content-type of an object with a POST.
#. The change in last-modified time of an object due to a POST is not reflected
in the container listing.
#. Container-Sync does not "sync" objects after their metadata has been changed
by a POST. This is a consequence of the container listing not being updated
and the Container-Sync process therefore not detecting that the object's
state has changed.
The solution is to implement fast-POST such that a POST to an object will
trigger a container update.
This will require all of the current semantics of container updates from a PUT
(or a DELETE) to be extended into POST and similarly cover all failure
scenarios. In particular container updates from a POST must be serialize-able
(in the log transaction sense, see :ref:`container_server`) so that
out-of-order metadata updates via POST and data updates via PUT and DELETE can
be replicated and reconciled across the container databases.
Additionally the new ssync replication engine has less operational testing with
fast-POST. Some behaviors are not well understood. Currently it seems ssync
with fast-POST has the following limitations:
#. A t0.data with a t2.meta on the sender can overwrite a t1.data file on the
receiver.
#. The whole .data file is transferred to sync only a metadata update.
If possible, or as follow on work (see :ref:`ssync`), ssync should
preserve the semantic differences of syncing updates to .meta and .data
files.
Problem Description
===================
The Swift client API describes that Swift allows an object's "metadata" to be
"updated" via the POST verb.
The client API also describes that a container listing includes, for each
object, the following specific items of metadata: name, size, hash (Etag),
last_modified timestamp and content-type. If any of these metadata values are
updated then the client expectation is that the named entry in the container
listing for that object should reflect their new values.
For example if an object is uploaded at t0 with a size s0 and etag m0, and then
later at t1 an object with the same name is successfully stored with a size of
s1 and etag m1, then the container listing should *eventually* reflect the new
values s1 and m1 for the named object last_modified at t1.
These two API features can both be satisfied by either:
#. Not allowing POST to change any of the metadata values tracked in the
container.
#. Ensuring that if a POST changes one of those metadata values then the
container is also updated.
It is reasonable to argue that some of the object metadata items stored in the
container should not be allowed to be changed by a POST - the object name, size
and hash should be considered immutable for a POST because a POST is restricted
from modifying the body of the object - from which both the etag and size are
derived.
However, it can reasonably be argued that content-type should be allowed to
change on a POST. It is also reasonable to argue that the last_modified time of
an object as reported by the container listing should be equal to the timestamp
of the most recent POST or PUT.
If content-type changes are to be allowed on a POST then the container listing
must be updated, in order to satisfy the client API expectations, but the
current implementation lacks support for container updates triggered by a POST:
#. The object-server POST path does not issue container update requests, or
store async pendings.
#. The container-server's PUT /object path has no semantics for a
partial update of an object row - meaning there is no way to change the
content-type of an object without creating a new record to replace the
old one. However, because a POST only describes a transformation of an
object, and not a complete update, an object server cannot reliably provide
the entire object state required to generate a new container record under a
single timestamp.
For example, an object server handling a POST may not have the most recent
object size and/or hash, and therefore should not include those items in a
container update under the timestamp of the POST.
#. The backend container replication process similarly does not support
replication of partially updated object records.
Consequently, updates to object metadata using the fast-POST mode results in an
inconsistency between the object state and the container listing: the
Last-Modified header returned on an HEAD request for an object will reflect the
last time of the last POST, while the value in the container listing will
reflect the time of the last PUT.
Furthermore, the container-sync process is unable to detect when object state
has been changed by a POST, since it relies on a new row being created in the
container database whenever an object changes.
Code archeology seems to support that the primary motivations for the
POST-as-COPY mode of operation were allowing content-type to be
modified without re-uploading the entire object with a PUT from the client,
and enabling container-sync to sync object metadata updates.
Proposed Changes
================
The changes proposed below contribute to achieving the property that all Swift
internal services which track the state of an object will eventually reach a
consistent view of the object metadata, which has three components:
#. immutable metadata (i.e. name, size and hash) that can only be set at the
time that the object data is set i.e. by a PUT request
#. content-type that is set by a PUT and *may* be modified by a POST
#. mutable metadata such as custom user metadata, which is set by a PUT or POST
Since each of these components could be set at different times on different
nodes, it follows that an object's state must include three timestamps, all or
some of which may be equal:
#. the 'data-timestamp', describing when the immutable metadata was set, which
is less than or equal to:
#. the 'content-type-timestamp', which is less than or equal to:
#. the 'metadata-timestamp' which describes when the object's mutable metadata
was set, and defines the Last-Modified time of the object.
We assert that to guarantee eventual consistency, Swift internal processes must
track the timestamp of each metadata component independently. Some or all of
the three timestamps will often be equal, but a Swift process should never
assert such equality unless it can be inferred from state generated by a client
request.
Proxy-server
------------
No changes required - the proxy server already includes container update
headers with backend object POST requests.
Object-server
-------------
#. The DiskFile class will be modified to allow content-type to
be updated and written to a .meta file. When content-type is updated by a
POST, a content-type-timestamp value equal to the POST request timestamp
will also be written to the .meta file.
#. The DiskFile class will be modified so that existing content-type and
content-type-timestamp values will be copied to a new .meta file if no new
values are provided.
#. The DiskFile interface will be modified to provide methods to access the
object's data-timestamp (already stored in the .data file), content-type
timestamp (as described above) and metadata-timestamp (already stored in the
.meta file).
#. The DiskFile class will be modified to support using encoded timestamps as
.meta file names (see :ref:`rsync` and :ref:`timestamp_encoding`).
#. The object-server POST path will be updated to issue container-update
requests with fallback to the async pending queue similar to the PUT path.
#. Container update requests triggered by a POST will include all three of
the object's timestamp values: the data-timestamp, the content-type
timestamp and the metadata-timestamp. These timestamps will either be sent
as separate headers or encoded into a single timestamp header
(:ref:`timestamp_encoding`) header value.
.. _container_server:
Container-server
----------------
#. The container-server 'PUT /<object>' path will be modified to support three
timestamp values being included in the update item that are stored in the
pending file and eventually passed to the database merge_items method.
#. The merge_items method will be modified so that any existing row for an
updated object is merged with the object update to produce a new row that
encodes the most recent of each of the metadata components and their
respective timestamps i.e. the row will encode three tuples::
(data-timestamp, size, name, hash)
(content-type-timestamp, content-type)
(metadata-timestamp)
This requires storing two additional timestamps which will be achieved by
either encoding all three timestamps in a single string stored in the
existing created_at column (:ref:`timestamp_encoding`) value stored in same
field as the existing (data) timestamp or by adding new columns to the
objects table. Note that each object will continue to have only one row in
the database table.
#. The container listing code will be modified to use the object's metadata
timestamp as the value for the reported last-modified time.
.. note::
With this proposal, new container db rows do not necessarily store all of
the attributes sent with a single object update. Each new row is now
comprised of the most recent metadata components from the update and any
existing row.
Container-replicator
--------------------
#. The container-replicator will be modified to ensure that all three object
timestamps are included in replication updates. At the receiving end these
are handled by the same merge_items method as described above.
.. _rsync:
rsync object replication
------------------------
With the proposed changes, .meta files may now contain a content-type value set
at a different time to the other mutable metadata. Unlike :ref:`ssync`, the
rsync based replication process has no visibility of the contents of the object
files. The replication process cannot therefore distinguish between two meta
files which have the same name but may contain different content-type and
content-type-timestamp values.
The naming of .meta files must therefore be modified so that the filename
indicates both the metadata-timestamp and the content-type-timestamp. The
current proposal is to use an encoding of the content-type-timestamp and
metadata-timestamp as the .meta file name. Specifically:
* if the the .meta file contains a content-type value, its name shall be
the encoding of the metadata-timestamp followed by the (older or equal)
content-type-timestamp, with a `.meta` extension.
* if the the .meta file does not contain a content-type value, its name shall
be the metadata-timestamp, with a `.meta` extension.
Other options for .meta file naming are discussed in :ref:`alternatives`.
The hash_cleanup_listdir function will be modified so that the decision as to
whether a particular meta file should be deleted will no longer be based on a
lexicographical sort of the file names - the file names will be decomposed into
a content-type-timestamp and a metadata-timestamp and the one (or two) file(s)
having the newest of each will be retained.
In addition the DiskFile implementation must be changed to preserve, and read,
up to two meta files in the object directory when their names indicate that one
contains the most recent content-type and the other contains the most recent
metadata.
Multiple .meta files will only exist until the next PUT or POST request is
handled. On a PUT, all older .meta files are deleted - their content is
obsolete. On a newer POST, the multiple .meta files are read and their contents
merged, taking the newest of user metadata and content-type. The merged
metadata is written to a single newer .meta file and all older .meta files are
deleted.
For example, consider an object directory that after rsync has the following
files (sorted)::
t0_offset.meta - unwanted
t2.data - wanted, most recent data-timestamp
enc(t6, t2).meta - wanted, most recent metadata-timestamp
enc(t4, t3).meta - unwanted
enc(t5, t5).meta - wanted, most recent content-type-timestamp
If a POST occurs at t7 with new user metadata but no new content-type value,
the contents of the directory after handling the post will be::
t2.data
enc(t7, t5).meta
Note that the when an object merges content-type and metadata-timestamp from
two .meta files, it is reconstructing the same state that will already have
been propagated to container servers. There is no need for object servers to
send container updates in response to replication events (i.e. no change to
current behavior in that respect).
.. _ssync:
Updates to ssync
----------------
Additionally we should endeavor to enumerate the required changes to ssync to
support the preservation of semantic difference between a POST and PUT. For
example:
#. The missing check request sent by the ssync_sender should include enough
information for the ssync_receiver to determine which of the object's state
is out of date i.e. none, some or all of data, content-type and metadata.
#. The missing check response from ssync_receiver should include enough
information for the ssync_sender to differentiate between a hash that
is "missing" and out-of-date content-type and/or metadata update.
#. When handling ssync_sender's send_list during the UPDATES portion, in
addition to sending PUT and DELETE requests the sender should be able
to send a pure metadata POST update
#. The ssync_receiver's updates method must be prepared to dispatch POST
requests to the underlying object-server app in addition to PUT and
DELETE requests.
The current ssync implementation seems to indicate that it was originally
intended to be optimized for the default POST-as-COPY configuration, and it
does not handle some corner cases with fast-POST as well as rsync replication.
Because ssync is still described as experimental, improving ssync support
should not be a requirement for resolving the current limitations of fast-POST
for rsync deployments. However ssync is still actively being developed and
improved, and remains a key component to a number of other efforts improve and
enhance Swift. Full ssync support for fast-POST should be a requirement for
making fast-POST the default.
.. _container-sync:
Container Sync
--------------
Container Sync will require both the ability to discover that an
object has changed, and the ability to request that object.
Because each object update via fast-POST will trigger a container
update, there will be a new row (and timestamp) in the container
databases for every update to an object (just like with POST-as-COPY
today!)
The metadata-timestamp in the database will reflect a complete version
of an object and metadata transformation. The exact version of the
object retrieved can be verified with X-Backend-Timestamp.
.. _x-newest:
X-Newest
--------
X-Newest should be updated to use X-Backend-Timestamp.
.. note::
We should fix the sync daemon from using the row[created_at] value
to set the x-timestamp of the object PUT to the peer container, and
have it instead use the X-Timestamp from the object being synced.
.. _timestamp_encoding:
Multiple Timestamp Encoding
---------------------------
If required, multiple timestamps t0, t1 ... will be encoded into a single
timestamp string having the form::
<t0[_offset]>[<+/-><offset_to_t1>[<+/-><offset_to_t2>]]
where:
* t0 may include an offset, if non-zero, with leading zero's removed from the
offset, e.g. 1234567890.12345_2
* offset_to_t1 is the difference in units of 10 microseconds between t0 and
t1, in hex, if non-zero
* offset_to_t2 is the difference in units of 10 microseconds between t1 and
t2, in hex, if non-zero
An example of encoding three monotonically increasing timestamps would be::
1234567890.12345_2+9f3c+aa322
An example of encoding of three equal timestamps would be::
1234567890.12345_2
i.e. identical to the shortened form of t0.
An example of encoding two timestamps where the second is older would be::
1234567890.12345_2-9f3c
Note that a lexicographical sort of encoded timestamps is not required to
result in any chronological ordering.
Example Scenarios
=================
In the following examples we attempt to enumerate various failure conditions
that would require making decisions about how the implementation serializes or
merges out-of-order metadata updates.
These examples use the current proposal for encoding multiple timestamps
:ref:`timestamp_encoding` in .meta file names and in the container db
`created_at` column. For simplicity we use the shorthand `t2-t1` to represent
the encoding of timestamps t2 and t1 in this form, but note that the `-t1` part
is in fact a time difference and not the absolute value of the t2 timestamp.
(The exact format of the .meta file name is still being discussed.)
Consider initial state for an object that was PUT at time t1::
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
Cont server 1,2,3: {ts=t1, etag=m1, size=s1, c_type=c1}
Happy Path
----------
All servers initially consistent, successful fast-POST at time t2 that
modifies an objects content-type. When all is well our object
servers will end up in a consistent state::
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
/t2+t2.meta {c_type=c2}
The proposal is for the fast-POST to trigger a container update that is
a combination of the existing metadata from the .data file and the new
content-type::
Cont server 1,2,3: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
.. note::
A container update will be issued for every POST even if the
content-type is not updated to ensure that the container listing
last-modified time is consistent with the object state, and to ensure
that a new row is created for container sync.
Now consider some failure scenarios...
Object node down
----------------
In this case only a subset of object nodes would receive the metadata
update::
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
/t2+t2.meta {c_type=c2}
Obj server 3: /t1.data {etag=m1, size=s1, c_type=c1}
Normal object replication will copy the metadata update t2 to the failed object
server 3, bringing its state in line with the other object servers.
Because the failed object node would not have updated it's respective
container server, that will be out of date as well::
Cont server 1,2: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
Cont server 3: {ts=t1, etag=m1, size=s1, c_type=c1}
During replication, row merging on server 3 would merge the content-type update
at t2 with the existing row to create a new row identical to that on servers 1
and 2..
Container update fails
----------------------
If a container server is offline while an object server is handling a POST then
the object server will store an async_pending of the update record in the same
as for PUTs and DELETEs.
Object node missing .data file
------------------------------
POST will return 404 and not process the request if the object does not
exist::
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
/t2+t2.meta {c_type=c2}
Obj server 3: 404
After object replication the object servers should have the same files. This
requires no change to rsync replication. ssync replication will be modified to
send a PUT with t1 (including content-type=c1) followed by a POST with t2
(including content-type=c2), i.e. ssync will replicate the requests received by
the healthy servers.
Object node stale .data file
----------------------------
If one object server has an older .data file then the composite timestamp sent
with it's container update will not match that of the other nodes::
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
/t2+t2.meta {c_type=c2}
Obj server 3: /t0.data {etag=m0, size=s0, c_type=c0}
/t2+t2.meta {c_type=c2}
After object replication the object servers should have the same files. This
requires no change to rsync replication. ssync replication will be modified to
send a PUT with t1, i.e. ssync will replicate the request missed by the failed
server.
Assuming container server 3 was also out of date, the container row will be
updated to::
Cont server 1,2: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
Cont server 3: {ts=t0+t2+t2, etag=m0, size=s0, c_type=c2}
During container replication on server 3, row merging will apply the later data
timestamp at t1 to the existing row to create a new row that matches servers 2
and 3.
Assuming container server 3 was also up to date, the container row will be
updated to::
Cont server 1,2: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
Cont server 3: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
Note that in this case the row merging has applied the content-type from the
update but ignored the immutable metadata from the update which is older than
the values in the existing db row.
Newest .data file node down
---------------------------
If none of the nodes that have the t1 .data file are available to handle the
POST at the time of the client request the the metadata may only be applied on
nodes having a stale .data file::
Obj server 1,2: /t0.data {etag=m0, size=s0, c_type=c0}
/t2+t2.meta {c_type=c2}
Obj server 3: /t1.data {etag=m1, size=s1, c_type=c1}
Object replication will eventually make the object servers consistent.
The containers may be similarly inconsistent::
Cont server 1,2: {ts=t0+t2+t2, etag=m0, size=s0, c_type=c2}
Cont server 3: {ts=t1, etag=m1, size=s1, c_type=c1}
During container replication on server 3, row merging will apply the
content-type update at t2 to the existing row but ignore the data-timestamp and
immutable metadata, since the existing row on server 3 has newer data
timestamp.
During replication on container servers 1 and 2, row merging will apply the
data-timestamp and immutable metadata updates from server 3 but ignore the
content-type update since they have a newer content-type-timestamp.
Additional POSTs with Content-Type to overwrite metadata
--------------------------------------------------------
If the initial state already includes a metadata update, the content-type may
have been overridden::
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
/t2+t2.meta {c_type=c2}
In this case the container's would also reflect the content-type of the
metadata update::
Cont server 1,2,3: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
When another POST occurs at t3 which includes a content-type update, the final
state of the object server would overwrite the last metadata update entirely::
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
/t3+t3.meta {c_type=c3}
Additional POSTs without Content-Type to overwrite metadata
-----------------------------------------------------------
If the initial state already includes a metadata update, the content-type may
have been overridden::
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
/t2+t2.meta {c_type=c2}
In this case the container's would also reflect the content-type of the
metadata update::
Cont server 1,2,3: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
When another POST occurs at t3 which does not include a content-type update,
the object server will merge its current record of the content-type with the
new metadata and store in a new .meta file, the name of which indicates that it
contains state modified at two separate times::
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
/t3-t2.meta {c_type=c2}
The container server updates will now encode three timestamps which will cause
row merging on the container servers to apply the metadata-timestamp to their
existing rows and create a new row for the object::
Cont server 1,2,3: {ts=t1+t2+t3, etag=m1, size=s1, c_type=c2}
Resolving conflicts with multiple metadata overwrites
-----------------------------------------------------
If a previous content-type update is not consistent across all nodes then a
subsequent metadata update at t3 that does not include a content-type value
will result in divergent metadata sets across the nodes::
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
/t3-t2.meta {c_type=c2}
Obj server 3: /t1.data {etag=m1, size=s1, c_type=c1}
/t3.meta
Even worse, if subsequent POSTs are not successfully handled successfully on
all nodes then we can end up with no single node having completely up to date
metadata::
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
/t3-t2.meta {c_type=c2}
Obj server 3: /t1.data {etag=m1, size=s1, c_type=c1}
/t4.meta
With rsync replication, each object server will eventually have a consistent
set of files, but will have two .meta files::
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
/t3-t2.meta {c_type=c2}
/t4.meta
When the diskfile is opened, both .meta files are read to retrieve the most
recent content-type and the most recent mutable metadata.
With ssync replication, the inconsistent nodes will exchange POSTs that will
eventually result in a consistent single .meta file on each node::
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
/t4-t2.meta {c_type=c2}
.. _alternatives:
Alternatives
============
Alternative .meta file naming
-----------------------------
#. Encoding the content-type timestamp followed by the metadata timestamp (i.e.
reverse the order w.r.t. the proposal. This would result in encodings that
always have a positive offset which is consistent with the
enc(data-timestamp, content-type-timestamp, metadata-timestamp) form used
in container updates. However, having the proposed encoding order ensures
that files having *some* content newer than a data file will always sort
ahead of the data file, which reduces the churn in diskfile code such as
hash_cleanup_listdir, and is arguably more intuitive for human inspection
("t2-offset.meta is preserved in the dir with t1.data because t2 is later
than t1", rather than "t0+offset is preserved in the dir with t1.meta
because the sum of t0 and offset is later than t1).
#. Using a two vector timestamp with the 'normal' part being the content-type
timestamp and the offset being the time delta to the metadata-timestamp.
(It is the author's understanding that it is safe to use a timestamp offset
to represent the metadata-timestamp in this way because .meta files will
never be assigned a timestamp offset by the container-reconciler, since the
container-reconciler only uses timestamp offsets to imposing an internal
ordering on object PUTs and DELETEs having the same external timestamp.)
This is in principle the same as the proposed option but possibly results
in a less compact filename and may create confusion with two vector
timestamps.
#. Using a combination of the metadata-timestamp and a hash of the .meta file
contents to form a name for the .meta file. The timestamp part allows for
cleanup of .meta files that are older than a .data or .ts file, while the
hash part distinguishes .meta that contain different Content-Type and/or
Content-Type timestamp values. During replication, all valid .meta files are
preserved in the object directory (the worst case number being capped at the
number of replicas in the object ring). When DiskFile loads the metadata,
all .meta files will be read and the most recent values merged into the
metadata dict. When the merged metadata dict is written, all contributing
.meta files may be deleted.
This option is more general in that it allows other metadata items to also
have individual timestamps (without requiring an unbounded number of
timestamps to be encoded in the .meta filename). It therefore supports
other potential new features such as updatable object sysmeta and
updatable user metadata. Any such feature is of course beyond the scope of
proposal.
Just use POST-as-COPY
---------------------
POST-as-COPY has some limitations that make it ill-suited for some workloads.
#. POST to large objects is slow
#. POST during failure can result in stale data being copied over fresher data.
Also because COPY is exposed to the client first hand the semantic behavior can
always be achieved explicitly by a determined client.
Force content-type-timestamp to be same as metadata-timestamp
-------------------------------------------------------------
We can simplify the management of .meta files by requiring every POST arriving
at an object server to include the content-type, and therefore remove the need
to maintain a separate content-type-timestamp. There would be no need to
maintain multiple meta files. Container updates would still need to be sent
during an object POST in order to keep the container server in sync with the
object state. The container server still needs to be modified to merge both
content-type and metadata-timestamps with an existing row.
The requirement for content-type to be included with every POST is unreasonably
onerous on clients, but could be achieved by having the proxy server retrieve
the current content-type using a HEAD request with X-Newest = True and insert
it into the backend POST when content-type is missing from the client POST.
However, this scheme violates our assertion that no internal process should
ever assume one of an object's timestamps to be equal to another. In this case,
the proxy is forcing the content-type-timestamp to be the same as the metadata
timestamp that is due to the incoming POST request. In failure conditions, the
proxy may read a stale content-type value, associate it with the latest
metadata-timestamp and as a result erroneously overwrite a fresher content-type
value.
If, as a development of this alternative, the proxy were also to read the
'current' content-type value and its timestamp using a HEAD with X-Newest, and
add both of these items to the backend object POST, then we are get back to the
object server needing to maintain separate content-type and metadata-timestamps
in .meta file.
Further, if the newest content-type in the system is unavailable during a POST
it would be lost, and worse yet if the latest value was associated with a
datafile there's no obvious way to correctly promote it's data timestamp
values in the containers short of doing the very merging described in this
spec - so it comes out as less desirable for the same amount of work.
Use the metadata-timestamp as last modified
-------------------------------------------
This is basically what both fast-POST and POST-as-COPY do today. When an
object's metadata is updated at t3 the x-timestamp for the transformed object
is t3. However, fast-POST never updates the last modified in the container
listing.
In the case of fast-POST it can apply a t3 metadata update asynchronously to a
t1 .data file because it restricts metadata updates from including changes to
metadata that would require being merged into a container update.
We want to be able to update the content-type and therefore the container
listing.
In the case of POST-as-COPY it can do this because the metadata update applied
to a .data file t0 is considered "newer" than the .data file t1. The record for
the transformation applied to the t0 data file at t3 is stored in the
container, and the record of the "newer" t1 .data file is irrelevant.
Use metadata-timestamp as primary portion of two vector timestamp
-----------------------------------------------------------------
This suggests the .data file timestamp would be the offset, and merging t3_t0
and t3_t1 would prefer t3_t1. However merging t3_t0 and t1 would prefer t3_t0
(much as POST-as-COPY does today). The unlink old method would have to be
updated for rsync replication to ensure that a t3_t0 metadata file "guards" a
t0 data against the "newer" t1 .data file.
It's generally presumed that a stale read during POST-as-COPY resulting in data
loss is rare, the same false-hope applies equivalently to this purposed
specification for a container updating fast-POST implementation. The
difference being this implementation would throw out the *meta* data update
with a preference to the latest .data file instead.
This alternative was rejected as workable but less desirable.
Implementation
==============
#. `Prefer X-Backend-Timestamp for X-Newest <https://review.openstack.org/133869>`_
#. `Update container on fast-POST <https://review.openstack.org/#/c/135380/>`_
#. `Make ssync compatible with fast-post meta files <https://review.openstack.org/#/c/138498/>`_
Assignee(s)
-----------
#. Alistair Coles (acoles)
#. Clay Gerrard (clayg)
Work Items
----------
TBD
Repositories
------------
None
Servers
-------
None
DNS Entries
-----------
None
Documentation
-------------
Changes may be required to API docs if the last modified time reported in a
container listing changes to be the time of a POST rather than the time of the
PUT (there is currently an inconsistency between POST-as-COPY operation and
fast-POST operation).
We may want to deprecate POST-as-COPY after successful implementation of this
proposal.
Security
--------
None
Testing
-------
New and modified unit tests will be required for the object server and
container-sync. Probe tests will be useful to verify behavior.
Dependencies
============
None

View File

@ -1,132 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
================================================
formpost should allow subprefix-based signatures
================================================
The signature used by formpost to validate a file upload should also be considered valid,
if the object_prefix, which is used to calculate the signature, is a real subprefix of the
object_prefix used in the action url of the form.
With this, sharing of data with external people is made much easier
via webbased applications, because just one signature is needed to create forms for every
pseudofolder in a container.
Problem Description
===================
At the moment, if one wants to use a form to upload data, the signature of the form must be
calculated using the same object_prefix as the object_prefix in the url of the action attribute
of the form.
We propose to allow dynamically created forms, which are valid for all object_prefixes which contain
a common prefix.
With this, one could generate one signature, which is valid for all pseudofolders in a container.
This signature could be used in a webapplication, to share every possible pseudofolder
of a container with external people. The user who wants to share his container would not be obliged
to generate a signature for every pseudofolder.
Proposed Change
===============
The formpost middleware should be changed. The code change would be really small.
If a subprefix-based signature is desired, the hmac_body of the signature must contain a "subprefix"
field to make sure that the creator of the signature explicitly allows uploading of objects into
sub-pseudofolders. Beyond that, the form must contain a hidden field "subprefix", too.
Formpost would use the value of this field to calculate a hash based on that
value. Furthermore, the middleware would check if the object path really contains this prefix.
Lets have one example: A user wants to share the pseudofolder "folder" with external users in
a web-based fashion. He (or a webapplication) calcluates the signature with the path
"/v1/my_account/container/folder" and subprefix "folder":
::
import hmac
from hashlib import sha1
from time import time
path = '/v1/my_account/container/folder'
redirect = 'https://myserver.com/some-page'
max_file_size = 104857600
max_file_count = 10
expires = int(time() + 600)
key = 'MYKEY'
hmac_body = '%s\n%s\n%s\n%s\n%s\n%s' % (path, redirect,
max_file_size, max_file_count, expires, "folder")
signature = hmac.new(key, hmac_body, sha1).hexdigest()
If an external user is willing to post to the subfolder folder/subfolder/, a form which contains
the above calculated signature and the hidden field subprefix would be used:
::
<![CDATA[
<form action="https://myswift/v1/my_account_container/folder/subfolder/"
method="POST"
enctype="multipart/form-data">
<input type="hidden" name="redirect" value="REDIRECT_URL"/>
<input type="hidden" name="max_file_size" value="BYTES"/>
<input type="hidden" name="max_file_count" value="COUNT"/>
<input type="hidden" name="expires" value="UNIX_TIMESTAMP"/>
<input type="hidden" name="signature" value="HMAC"/>
<input type="hidden" name="subprefix" value="folder"
<input type="file" name="FILE_NAME"/>
<br/>
<input type="submit"/>
</form>
]]>
Implementation
==============
Assignee(s)
-----------
Primary assignee:
bartz
Work Items
----------
Add modifications to formpost and respective test module.
Repositories
------------
None
Servers
-------
None
DNS Entries
-----------
None
Documentation
-------------
Modify documentation for formpost middleware.
Security
--------
None
Testing
-------
Tests should be added to the existing test module.
Dependencies
============
None

View File

@ -1,183 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
=====================================================
Improve Erasure Coding Efficiency for Global Cluster
=====================================================
This SPEC describes an improvement of efficiency for Global Cluster with
Erasure Coding. It proposes a way to improve the PUT/GET performance
in the case of Erasure Coding with more than 1 regions ensuring original
data even if a region is lost.
Problem description
===================
Swift now supports Erasure Codes (EC) which ensures higher durability and lower
disk cost than the replicated case for a one region cluster. However, currently
if Swift were running EC over 2 regions, using < 2x data redundancy
(e.g. ec_k=10, ec_m=4) and then one of the regions is gone due to some unfortunate
reasons (e.g. huge earthquake, fire, tsunami), there is a chance data would be lost.
That is because, assuming each region has an even available volume of disks, each
region should have around 7 fragments, less than ec_k, which is not enough data
for the EC scheme to rebuild the original data.
To protect stored data and to ensure higher durability, Swift has to keep >= 1
data size for each region (i.e. >= 2x for 2 regions) by employing larger ec_m
like ec_m=14 for ec_k=10. However, this increase sacrifices encode performance.
In my measurements running PyECLib encode/decode on an Intel Xeon E5-2630v3 [1], the
benchmark result was as follows:
+----------------+----+----+---------+---------+
|scheme |ec_k|ec_m|encode |decode |
+================+====+====+=========+=========+
|jerasure_rs_vand|10 |4 |7.6Gbps |12.21Gbps|
+----------------+----+----+---------+---------+
| |10 |14 |2.67Gbps |12.27Gbps|
+----------------+----+----+---------+---------+
| |20 |4 |7.6Gbps |12.87Gbps|
+----------------+----+----+---------+---------+
| |20 |24 |1.6Gbps |12.37Gbps|
+----------------+----+----+---------+---------+
|isa_lrs_vand |10 |4 |14.27Gbps|18.4Gbps |
+----------------+----+----+---------+---------+
| |10 |14 |6.53Gbps |18.46Gbps|
+----------------+----+----+---------+---------+
| |20 |4 |15.33Gbps|18.12Gbps|
+----------------+----+----+---------+---------+
| |20 |24 |4.8Gbps |18.66Gbps|
+----------------+----+----+---------+---------+
Note that "decode" uses (ec_k + ec_m) - 2 fragments so performance will
decrease less than when encoding as is shown in the results above.
In the results above, comparing ec_k=10, ec_m=4 vs ec_k=10, ec_m=14, the encode
performance falls down about 1/3 and other encodings follow a similar trend.
This demonstrates that there is a problem when building a 2+ region EC cluster.
1: http://ark.intel.com/ja/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz
Proposed change
===============
Add an option like "duplication_factor". Which will create duplicated (copied)
fragments instead of employing a larger ec_m.
For example, with a duplication_factor=2, Swift will encode ec_k=10, ec_m=4 and
store 28 fragments (10x2 data fragments and 4x2 parity fragments) in Swift.
This requires a change to PUT/GET and the reconstruct sequence to map from the
fragment index in Swift to actual fragment index for PyECLib but IMHO we don't
need to make an effort to build much modification for the conversation among
proxy-server <-> object-server <-> disks.
I don't want describe the implementation in detail in the first patch of the spec
because it should be an idea to improve Swift. More discussion on the implementation
side will following in subsequent patches.
Considerations of acutal placement
----------------------------------
Placement of these doubled fragments are important. If the same fragments,
original and copied, appear in the same region and the second region fails,
then we would be in the same situation where we couldn't rebuild the original
object as we were in the smaller parity fragments case.`
e.g:
- duplication_factor=2, k=4, m=2
- 1st Region: [0, 1, 2, 6, 7, 8]
- 2nd Region: [3, 4, 5, 9, 10, 11]
- (Assuming actual indices to rebuild mapped as index // (k+m))
In this case, 1st region has only fragments consisting of fragment index 0, 1, 2
and 2nd has only 3, 4, 5. Therefore, it is not able to rebuild the original object
from the fragments in only one region because the fragment uniqueness in the
region is less than k. The worst case scenario, like this, will cause significant data
loss as would happen with no duplication factor.
i.e. In fact, data durability will be
- "no duplication" < "with duplication" < "more unique parities"
In future work, we can find a way to tie a fragment index to a region,
something like "1st subset should be in 1st Region and 2nd subset
should be ..." but so far this is beyond this spec.
Alternatives
------------
We can find a way to use container-sync as a solution to the problem rather
then employing my proposed change.
This section will describe the pros/cons for my "proposed change" and "container-sync".
Proposed Change
^^^^^^^^^^^^^^^
Pros:
- Higher performance way to spread objects across regions (No need to re-decode/encode for transferring across regions)
- No extra configuration other than storage policy is needed for users to turn on the global replication. (strictly global erasure coding?)
- Able to use other global cluster efficiency improvements (affinity control)
Cons:
- Need to employ more complex handling around ECObjecController
Container-Sync
^^^^^^^^^^^^^^
Pros:
- Simple and able to reuse existing swift mechanisms
- Less data transfer between regions
Cons:
- Re-decode/encode is required when transferring objects to another region
- Need to set the sync option for each container
- Impossible to retrieve/reconstruct an object when > ec_m disks unavailable (includes ip unreachable)
Implementation
==============
- Proxy-Server PUT/GET path
- Object-Reconstructor
- (Optional) Ring placement strategy
Questions and Answers
=====================
- TBD
Assignee(s)
-----------
Primary assignee:
kota\_ (Kota Tsuyuzaki)
Work Items
----------
Develop codes around proxy-server and object-reconstructor
Repositories
------------
None
Servers
-------
None
DNS Entries
-----------
None
Dependencies
============
None

Binary file not shown.

Before

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 85 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 164 KiB

View File

@ -1,449 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
===============================
Increasing ring partition power
===============================
This document describes a process and modifications to swift code that
together enable ring partition power to be increased without cluster downtime.
Swift operators sometimes pick a ring partition power when deploying swift
and later wish to change the partition power:
#. The operator chooses a partition power that proves to be too small and
subsequently constrains their ability to rebalance a growing cluster.
#. Perhaps more likely, in an attempt to avoid the above problem, operators
choose a partition power that proves to be unnecessarily large and would
subsequently like to reduce it.
This proposal directly addresses the first problem by enabling partition power
to be increased. Although it does not directly address the second problem
(i.e. it does not enable ring power reduction), it does indirectly help to
avoid that problem by removing the motivation to choose large partition power
when first deploying a cluster.
Problem Description
===================
The ring power determines the partition to which a resource (account, container
or object) is mapped. The partition is included in the path under which the
resource is stored in a backend filesystem. Changing the partition power
therefore requires relocating resources to new paths in backend filesystems.
In a heavily populated cluster a relocation process could be time-consuming and
so to avoid down-time it is desirable to relocate resources while the cluster
is still operating. However, it is necessary to do so without (temporary) loss
of access to data and without compromising the performance of processes such as
replication and auditing.
Proposed Change
===============
Overview
--------
The proposed solution avoids copying any file contents during a partition power
change. Objects are 'moved' from their current partition to a new partition,
but the current and new partitions are arranged to be on the same device, so
the 'move' is achieved using filesystem links without copying data.
(It may well be that the motivation for increasing partition power is to allow
a rebalancing of the ring. Any rebalancing would occur after the partition
power increase has completed - during partition power changes the ring balance
is not changed.)
To allow the cluster to continue operating during a partition power change (in
particular, to avoid any disruption or incorrect behavior of the replicator and
auditor processes), new partition directories are created in a separate
filesystem branch from the current partition directories. When all new
partition directories have been populated, the ring transitions to using the
new filesystem branch.
During this transition, object servers maintain links to resource files from
both the current and new partition directories. However, as already discussed,
no file content is duplicated or copied. The old partition directories are
eventually deleted.
Detailed description
--------------------
The process of changing a ring's partition power comprises three phases:
1. Preparation - during this phase the current partition directories continue
to be used but existing resources are also linked to new partition
directories in anticipation of the new ring partition power.
2. Switchover - during this phase the ring transitions to using the new
partition directories; proxy and backend servers rollover to using the new
ring partition power.
3. Cleanup - once all servers are using the new ring partition power,
resource files in old partition directories are removed.
For simplicity, we describe the details of each phase in terms of an object
ring but note that the same process can be applied to account and container
rings and servers.
Preparation phase
^^^^^^^^^^^^^^^^^
During the preparation phase two new attributes are set in the ring file:
* the ring's `epoch`: if not already set, a new `epoch` attribute is added to
the ring. The ring epoch is used to determine the parent directory for
partition directories. Similar to the way in which a ring's policy index is
appended to the `objects` directory name, the epoch will be prefixed to the
`objects` directory name. For simplicity, the ring epoch will be a
monotonically increasing integer starting at 0. A 'legacy' ring having no
epoch attribute will be treated as having epoch 0.
* the `next_part_power` attribute indicates the partition power that will be
used in the next epoch of the ring. The `next_part_power` attribute is used
during the preparation phase to determine the partition directory in which
an object should be stored in the next epoch of the ring.
At this point in time no other changes are made to the ring file:
the current part power and the mapping of partitions to devices are unchanged.
The updated ring file is distributed to all servers. During this preparation
phase, proxy servers will continue to use the current ring partition mapping to
determine the backend url for objects. Object servers, along with replicator
and auditor processes, also continue to use the current ring
parameters. However, during PUT and DELETE operations object servers will
create additional links to object files in the object's future partition
directory in preparation for an eventual switchover to the ring's next
epoch. This does not require any additional copying or writing of object
contents.
The filesystem path for future partition directories is determined as follows.
In general, the path to an object file on an object server's filesystem has the
form::
dev/[<epoch>-]objects[-<policy>]/<partition>/<suffix>/<hash>/<ts>.<ext>
where:
* `epoch` is the ring's epoch, if non-zero
* `policy` is the object container's policy index, if non-zero
* `dev` is the device to which `partition` is mapped by the ring file
* `partition` is the object's partition,
calculated using `partition = F(hash) >> (32 - P)`,
where `P` is the ring partition power
* `suffix` is the last three digits of `hash`
* `hash` is a hash of the object name
* `ts` is the object timestamp
* `ext` is the filename extension (`data`, `meta` or `ts`)
Given `next_part_power` and `epoch` in the ring file, it is possible to
calculate::
future_partition = F(hash) >> (32 - next_part_power)
next_epoch = epoch + 1
The future partition directory is then::
dev/<next_epoch>-objects[-<policy>]/<next_partition>/<suffix>/<hash>/<ts>.<ext>
For example, consider a ring in its first epoch, with current partition power
P, containing an object currently in partition X, where 0 <= X < 2**P. If the
partition power increases by a factor of 2, the object's future partition will
be either 2X or 2X+1 in the ring's next epoch. During a DELETE an additional
filesystem link will be created at one of::
dev/1-objects/<2X>/<suffix>/<hash>/<ts>.ts
dev/1-objects/<2X+1>/<suffix>/<hash>/<ts>.ts
Once object servers are known to be using the updated ring file a new relinker
process is started. The relinker prepares an object server's filesystem for a
partition power change by crawling the filesystem and linking existing objects
to future partition directories. The relinker determines each object's future
partition directory in the same way as described above for the object server.
The relinker does not remove links from current partition directories. Once the
relinker has successfully completed, every existing object should be linked
from both a current partition directory and a future partition directory. Any
subsequent object PUTs or DELETEs will be reflected in both the current and
future partition directory as described above.
To avoid newly created objects being 'lost', it is important that an object
server is using the updated ring file before the relinker process starts in
order to guarantee that either the object server or the relinker create future
partition links for every object. This may require object servers to be
restarted prior to the relinker process being started, or to otherwise report
that they have reloaded the ring file.
The relinker will report successful completion in a file
`/var/cache/swift/relinker.recon` that can be queried via (modified) recon
middleware.
Once the relinker process has successfully completed on all object servers, the
partition power change process may move on to the switchover phase.
Switchover phase
^^^^^^^^^^^^^^^^
To begin the switchover to using the next partition power, the ring file is
updated once more:
* the current partition power is stored as `previous_part_power`
* the current partition power is set to `next_partition_power`
* `next_partition_power` is set to None
* the ring's `epoch` is incremented
* the mapping of partitions to devices is re-created so that partitions 2X and
2X+1 map to the same devices to which partition X was mapped in the previous
epoch. This is a simple transformation. Since no object content is moved
between devices the actual ring balance remains unchanged.
The updated ring file is then distributed to all proxy and object servers.
Since ring file distribution and loading is not instantaneous, there is a
window of time during which a proxy server may direct object requests to either
an old partition or a current partition (note that the partitions previously
referred to as 'future' are now referred to as 'current'). Object servers will
therefore create additional filesystem links during PUT and DELETE requests,
pointing from old partition directories to files in the current partition
directories. The paths to the old partition directories are determined in the
same way as future partition directories were determined during the preparation
phase, but now using the `previous_part_power` and decrementing the current
ring `epoch`.
This means that if one proxy PUTs an object using a current partition, then
another proxy subsequently attempts to GET the object using the old partition,
the object will be found, since both current and old partitions map to the same
device. Similarly if one proxy PUTs an object using the old partition and
another proxy then GETs the object using the current partition, the object will
be found in the current partition on the object server.
The object auditor and replicator processes are restarted to force reloading of
the ring file and commence to operate using the current ring parameters.
Cleanup phase
^^^^^^^^^^^^^
The cleanup phase may start once all servers are known to be using the updated
ring file. Once again, this may require servers to be restarted or to report
that they have reloaded the ring file during switchover.
A final update is made to the ring file: the `previous_partition_power`
attribute is set to `None` and the ring file is once again distributed. Once
object servers have reloaded the update ring file they will cease to create
object file links in old partition directories.
At this point the old partition directories may be deleted - there is no need
to create tombstone files when deleting objects in the old partitions since
these partition directories are no longer used by any swift process.
A cleanup process will crawl the filesystem and delete any partition
directories that are not part of the current epoch or a future epoch. This
cleanup process should repeat periodically in case any devices that were
offline during the partition power change come back online - the old epoch
partition directories discovered on those devices may be deleted. Normal
replication may cause current epoch partition directories to be created on a
resurrected disk.
(The cleanup function could be added to an existing process such as the
auditor).
Other considerations
--------------------
swift-dispersion-[populate|report]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The swift-dispersion-[populate|report] tools will need to be made epoch-aware.
After increasing partition power, swift-dispersion-populate may need to be
run to achieve the desired coverage. (Although initially the device coverage
will remain unchanged, the percentage of partitions covered will have reduced
by whatever factor the partition power has increased.)
Auditing
^^^^^^^^
During preparation and switchover, the auditor may find a corrupt object. The
quarantine directory is not in the epoch partition directory filesystem branch,
so a quarantined object will not be lost when old partitions are deleted.
The quarantining of an object in a current partition directory will not remove
the object from a future partition, so after switchover the auditor will
discover the object again, and quarantine it again. The diskfile quarantine
renamer could optionally be made 'relinker' aware and unlink duplicate object
references when quarantining an object.
Alternatives
------------
Prior work
^^^^^^^^^^
The swift_ring_tool_ enables ring power increases while swift services are
disabled. It takes a similar approach to this proposal in that the ring
mapping is changed so that every resource remains on the same device when
moved to its new partition. However, new partitions are created in the
same filesystem branch as existing (hence the need for services to be suspended
during the relocation).
.. _swift_ring_tool: https://github.com/enovance/swift-ring-tool/
Previous proposals have been made to upstream swift:
https://bugs.launchpad.net/swift/+bug/933803 suggests a 'same-device'
partition re-mapping, as does this proposal, but did not provide for
relocation of resources to new partition directories.
https://review.openstack.org/#/c/21888/ suggests maintaining a partition power
per device (so only new devices use the increase partition power) but appears
to have been abandoned due to complexities with replication.
Create future partitions in existing `objects[-policy]` directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The duplication of filesystem entries for objects and creation of (potentially
duplicate) partitions during the preparation phase could have undesirable
effects on other backend processes if they are not isolated in another
filesystem branch.
For example, the object replicator is likely to discover newly created future
partition directories that appear to be 'misplaced'. The replicator will
attempt to sync these to their primary nodes (according to the old ring
mapping) which is unnecessary. Worse, the replicator might then delete the
future partitions from their current nodes, undoing the work of the relinker
process.
If the replicator were to adopt the future ring mappings from the outset of the
preparation phase then the same problems arise with respect to current
partitions that now appear to be misplaced. Furthermore, the replication
process is likely to race with the relinker process on remote nodes to
populate future partitions: if relocation proceeds faster on node A than B then
the replicator may start to sync objects from A to B, which is again
unnecessary and expensive.
The auditor will also be impacted as it will discover objects in the future
partition directories and audit them, being unable to distinguish them as
duplicates of the object still stored in the current partition.
These issues could of course be avoided by disabling replication and auditing
during the preparation phase, but instead we propose to make the future ring
partition naming be mutually exclusive from current ring partition naming, and
simply restrict the replicator and auditor to only process partitions that are
in the current ring partition set. In other words we isolate these processes
from the future partition directories that are being created by the relinker.
Use mutually exclusive future partitions in existing `objects` directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The current algorithm for calculating the partition for an object is to
calculate a 32 bit hash of the object and then use its P most significant bits,
resulting in partitions in the range {0, 2**P - 1}. i.e.::
part = H(object name) >> (32 - P)
A ring with partition power P+1 will re-use all the partition numbers of a ring
with partition power P.
To eliminate overlap of future ring partitions with current ring partitions we
could change the partition number algortihm to add an offset to each partition
number when a ring's partition power is increased:
offset = 2**P part = (H(object name) >> (32 - P)) + offset
This is backwards compatible: if `offset` is not defined in a ring file then it
is set to zero.
To ensure that partition numbers remain < 2**32, this change will reduce the
maximum partition power from 32 to 31.
Proxy servers start to use the new ring at outset of relocation phase
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This would mean that GETs to backends would use the new rings partitions in
object urls. Objects may not yet have been relocated to their new partition
directory and the object servers would therefore need to fall back to looking
in the old ring partition for the object. PUTs and DELETEs to the new partition
would need to be made conditional upon a newer object timestamp not existing in
the old location. This is more complicated than the proposed method.
Enable partition power reduction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Ring power reduction is not easily achieved with the approach presented in this
proposal because there is no guarantee that partitions in the current epoch
that will be merged into partitions in the next epoch are located on the same
device. File contents are therefore likely to need copying between devices
during a preparation phase.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
alistair.coles@hp.com
Work Items
----------
#. modify ring classes to support new attributes
#. modify ringbuilder to manage new attributes
#. modify backend servers to duplicate links to files in future epoch partition
directories
#. make backend servers and relinker report their status in a way that recon
can report e.g. servers report when a new ring epoch has been loaded, the
relinker reports when all relinking has been completed.
#. make recon support reporting these states
#. modify code that assumes storage-directory is objects[-policy_index] to
be aware of epoch prefix
#. make swift-dispersion-populate and swift-dispersion-report epoch-aware
#. implement relinker daemon
#. document process
Repositories
------------
No new git repositories will be created.
Servers
-------
No new servers are created.
DNS Entries
-----------
No DNS entries will to be created or updated.
Documentation
-------------
Process will be documented in the administrator's guide. Additions will be made
to the ring-builder documents.
Security
--------
No security issues are foreseen.
Testing
-------
Unit tests will be added for changes to ring-builder, ring classes and
object server.
Probe tests will be needed to verify the process of increasing ring power.
Functional tests will be unchanged.
Dependencies
============
None

View File

@ -1,114 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
==============================================
Send notifications on PUT/POST/DELETE requests
==============================================
Swift should be able to send out notifications if new objects are uploaded,
metadata has been changed or data has been deleted.
Problem Description
===================
Currently there is no way to detect changes in a given container except listing
it's contents and comparing timestamps. This makes it difficult and slow in case
there are a lot of objects stored, and it is not very efficient at all.
Some external services might be interested when an object got uploaded, updated
or deleted; for example to store the metadata in an external database for
searching or to trigger specific events like computing on object data.
Proposed Change
===============
A new middleware should be added that can be configured to run inside the proxy
server pipeline.
Alternatives
------------
Another option might be to analyze logfiles and parsing them, aggregating data
into notifications per account and sending batches of updates to an external
service. However, performance is most likely worse since there is a lot of
string parsing involved, and a central logging service might be required to send
notifications in order.
Implementation
==============
Sending out notifications should happen when an object got modified. That means
every successful object change (PUT, POST, DELETE) should trigger an action and
send out an event notification.
It should be configurable on either an account or container level that
notifications should be sent; this leaves it up to the user to decide where they
end up and if a possible performance impact is acceptable.
An implementation should be developed as an additional middleware inside the Swift
proxy, and make use of existing queuing implementations within OpenStack,
namely Zaqar (https://wiki.openstack.org/wiki/Zaqar).
It needs to be discussed if metadata that is stored along the object should be
included in the notification or not; if there is a lot of metadata the
notifications are getting quite large. A possible trade off might be a threshold
for included metadata, for example only the first X bytes. Or send no metadata
at all, but only the account/container/objectname.
Assignee(s)
-----------
Primary assignee:
cschwede
Work Items
----------
Develop middleware for the Swift proxy server including functional tests.
Update Swift functional test VMs to include Zaqar service for testing.
Repositories
------------
None
Servers
-------
Functional tests require either a running Zaqar service on the testing VM, or a
dummy implementation that acts like a Zaqar queue.
DNS Entries
-----------
None
Documentation
-------------
Add documentation for new middleware
Security
--------
Notifications should be just enabled or disabled per-container, and the
receiving server should be set only in the middleware configuration setting.
This prevents users from forwarding events to an own, external side that the
operator is not aware of.
Enabling or disabling should be restricted to account owners.
Sent notifications include the account/container/objectname, thus traffic should
be transmitted over a private network or SSL-encrypted.
Testing
-------
Unit and functional testing shall be included in a patch.
Dependencies
============
- python-zaqarclient: https://github.com/openstack/python-zaqarclient
- zaqar service running on the gate (inside the VM)

View File

@ -1,146 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
=====================================
tempurls with a prefix-based scope
=====================================
The tempurl middleware should be allowed to use a prefix-based signature, which grants access for
all objects with this specific prefix. This allows access to a whole container or pseudofolder
with one signature, instead of using a new signature for each object.
Problem Description
===================
At the moment, if one wants to share a large amount of objects inside a container/pseudofolder
with external people, one has to create temporary urls for each object. Additionally, objects which
are placed inside the container/pseudofolder after the generation of the signature cannot
be accessed with the same signature.
Prefix-based signatures would allow to reuse the same signature for a large amount of objects
which share the same prefix.
Use cases:
1. We have one pseudofolder with 1000000 objects. We want to share this pseudofolder with external
partners. Instead of generating 1000000 different signatures, we only need to generate one
signature.
2. We have an webbased-application on top of swift like the swiftbrowser
(https://github.com/cschwede/django-swiftbrowser), which acts as a filebrowser. We want to
support the sharing of temporary pseudofolders with external people. We do not know in advance
which and how many objects will live inside the pseudofolder.
With prefix-based signatures, we could develop the webapplication in a way so that the user
could generate a temporary url for one pseudofolder, which could be used by external people
for accessing all objects which will live inside it
(this use-case additionaly needs a temporary container listing, to display which objects live
inside the pseudofolder and a modification of the formpost middleware, please see spec
https://review.openstack.org/#/c/225059/).
Proposed Change
===============
The temporary url middleware should be changed. The code change should not be too big.
If the client desires to use a prefix-based signature, he can append an URL parameter
"temp_url_prefix" with the desired prefix (an empty prefix would specify the whole container),
and the middleware would only use the container path + prefix for calculating the signature.
Furthermore, the middleware would check if the object path really contains this prefix.
Lets have two examples. In the first example, we want to allow a user to upload a bunch of objects
in a container c.
He first creates a tempurl, for example using the swift command line tool
(modified version which supports tempurls on container-level scope):
::
$swift tempurl --container-level PUT 86400 /v1/AUTH_account/c/ KEY
/v1/AUTH_account/c/?temp_url_sig=9dd9e9c318a29c6212b01343a2d9f9a4c9deef2d&temp_url_expires=1439280760&temp_url_prefix=
The user then uploads a bunch of files using each time the same container-level signature:
::
$curl -XPUT --data-binary @file1 https://example.host/v1/AUTH_account/c/o1?temp_url_sig=9dd9e9c318a29c6212b01343a2d9f9a4c9deef2d&temp_url_expires=1439280760&temp_url_prefix=
$curl -XPUT --data-binary @file2 https://example.host/v1/AUTH_account/c/o2?temp_url_sig=9dd9e9c318a29c6212b01343a2d9f9a4c9deef2d&temp_url_expires=1439280760&temp_url_prefix=
$curl -XPUT --data-binary @file3 https://example.host/v1/AUTH_account/c/p/o3?temp_url_sig=9dd9e9c318a29c6212b01343a2d9f9a4c9deef2d&temp_url_expires=1439280760&temp_url_prefix=
In the next example, we want to allow an external user to download a whole pseudofolder p:
::
$swift tempurl --container-level GET 86400 /v1/AUTH_account/c/p KEY
/v1/AUTH_account/c/p?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
$curl https://example.host/v1/AUTH_account/c/p/o1?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
$curl https://example.host/v1/AUTH_account/c/p/o2?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
$curl https://example.host/v1/AUTH_account/c/p/p2/o3?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
Following requests would be denied, because of missing/wrong prefix:
::
$curl https://example.host/v1/AUTH_account/c/o4?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
$curl https://example.host/v1/AUTH_account/c/p3/o5?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
Alternatives
------------
A new middleware could be introduced. But it seems that this would only lead to a lot of
code-copying, as the changes are really small in comparison to the original middleware.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
bartz
Work Items
----------
Add modifications to tempurl and respective test module.
Repositories
------------
None
Servers
-------
None
DNS Entries
-----------
None
Documentation
-------------
Modify documentation for tempurl middleware.
Security
--------
The calculation of the signature uses the hmac module (https://docs.python.org/2/library/hmac.html)
in combination with the sha1 hash function.
The difference of a prefix-based signature to the current object-path-based signature is, that
the path is shrunk to the prefix. The remaining part of the calculation stays the same.
A shorter path induces a shorter message as input to the hmac calculation, which should not reduce
the cryptographic strength. Therefore, I do not see security-related problems with introducing
a prefix-based signature.
Testing
-------
Tests should be added to the existing test module.
Dependencies
============
None

View File

@ -1,123 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
This template should be in ReSTructured text. Please do not delete
any of the sections in this template. If you have nothing to say
for a whole section, just write: "None". For help with syntax, see
http://sphinx-doc.org/rest.html To test out your formatting, see
http://www.tele3.cz/jbar/rest/rest.html
==================================================
Swift Request Tagging for detailed logging/tracing
==================================================
URL of your blueprint:
None.
To tag a particular request/every 'x' requests, which would undergo more detailed logging.
Problem Description
===================
Reasons for detailed logging:
- A Swift user is having problems, which we cannot recreate but could tag this user request for more logging.
- In order to better investigate a cluster for bottlenecks/problems - Internal user (admin/op) wants additional info on some situations where the client is getting inconsistent container listings. With the Swift-inspector, we can tell what node is not returning the correct listings.
Proposed Change
===============
Existing: Swift-Inspector (https://github.com/hurricanerix/swift-inspector ) currently
provides middleware in Proxy and Object servers. Relays info about a request back to the client with the assumption that the client is actively making a decision to tag a request to trigger some action that would not otherwise occur.
Current Inspectors:
- Timing -Inspector-Timing: gives the amount of time it took for the proxy-server to process the request
- Handlers Inspector-Handlers: not implemented (meant to return the account/container/object servers that were contacted in the request) Inspector-Handlers-Proxy: returns the proxy that handled the request
- Nodes - Inspector-Nodes: returns what account/container/object servers the path resides on Inspector-More-Nodes: returns extra nodes for handoff.
Changes:
- Add logging inspector to the above inspectors , which would enable detailed logging for tagged requests.
- Add the capability to let the system decide (instead of the client) to tag a request and nice to add rules to trigger actions like extra logging etc.
Possible Tagging criteria: Tagging
- every 'x' requests/ a % of all requests.
- based on something in the request/response headers (e.g.if the HTTP method is DELETE, or the response is sending a specific status code back)
- based on a specific account/container/object/feature.
Alternatives
------------
- Logging: log collector/log aggregator like logstash.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
https://launchpad.net/~shashirekha-j-gundur
Work Items
----------
- To add an Inspector Logging to existing inspectors , to enable the logs.
- Add rules to tag decide which requests to be tagged
- Trigger actions like logging.
- Restrict the access of nodes/inventory list displayed to admins/ops only.
- Figure out hmac_key access (Inspector-Sig) and Logging work together?
Repositories
------------
Will any new git repositories need to be created? Yes.
Servers
-------
Will any new servers need to be created? No.
What existing servers will be affected? Proxy and Object servers.
DNS Entries
-----------
Will any other DNS entries need to be created or updated? No.
Documentation
-------------
Will this require a documentation change? Yes , Swift-inspector docs.
Will it impact developer workflow? No.
Will additional communication need to be made? No.
Security
--------
None.
Testing
-------
Unit tests.
Dependencies
============
- Swift-Inspector https://github.com/hurricanerix/swift-inspector
- Does it require a new puppet module? No.

View File

@ -1,81 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
===============================
PACO Single Process deployments
===============================
Since the release of the DiskFile API, there's been a number of different
implementations providing the ability of storing Swift objects in the
third-party storage systems. Commonly these systems provide the durability and
availability of the objects (e.g., GlusterFS, GPFS), thus requiring the object
ring to be created with only one replica.
A typical deployment style for this configuration is a "PACO" deployment,
where the proxy, account, container and object services are running on the same
node. The object ring is built in a such a way that the proxy server always
send requests to the local object server. The object server (with it's
third-party DiskFile) is then responsible for writing the data to the underlying
storage system which will then distribute the data according to its own
policies.
Problem description
===================
In a typical swift deployment, proxy nodes send data to object
servers running on different nodes and the object servers write the data
directly to disk. In the case of third-party storage systems, the object server
typically makes another network connection to send the object to that storage
system, adding some latency to the data path.
Even when the proxy and object servers are on the same node, latency is still
introduced due to RPC communication over local network.
Proposed change
===============
For the scenario of single replica - PACO deployments, the proxy server would
be sending data directly to the third-party storage systems. To accomplish this
we would like to call the object wsgi application directly from
the proxy process instead of making the additional network connection.
This proposed solution focuses on reducing the proxy to object server latency
Proxy to account and/or container communications would stay the same for now
and be addressed on later patch.
Assignee(s)
-----------
Primary assignee:
thiago@redhat.com
Work Items
----------
A WiP patch has been submitted: https://review.openstack.org/#/c/159285/.
The work that has been done recently to the Object Controllers in the proxy
servers provides the ability for a very nice separation of the code.
TODOs and where further investigation is needed:
* How to load the object WSGI application instance in the proxy process?
* How to add support for multiple storage policies?
Prototype
---------
To test patch `159285 <https://review.openstack.org/#/c/159285/>`_ follow these
steps:
#. Create new single replica storage system. Update swift.conf and create new
ring. The port provided during ring creation will not be used for anything.
#. Create an object-server config file: ``/etc/swift/single-process.conf``.
This configuration file can look like any other object-server configuration
file, just make sure it specifies the correct device the object server
should be writing to. For example, in the case of `Swift-on-File <https://github.com/stackforge/swiftonfile>`_
object server, the device is the mountpoint of the shared filesystem (i.e.,
Gluster, GPFS).
#. Start the proxy.

View File

@ -1,295 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
====================
Swift Symbolic Links
====================
1. Problem description
======================
With the advent of storage policies and erasure codes, moving an
object between containers is becoming increasingly useful. However, we
don't want to break existing references to the object when we do so.
For example, a common object lifecycle has the object starting life
"hot" (i.e. frequently requested) and gradually "cooling" over time
(becoming less frequently requested). The user will want an object to
start out replicated for high requests-per-second while hot, but
eventually transition to EC for lower storage cost once cold.
A completely different use case is when an application is sharding
objects across multiple containers, but finds that it needs to use
even more containers; for example, going from 256 containers up to
4096 as write rate goes up. The application could migrate to the new
schema by creating 4096-sharded references for all 256-sharded
objects, thus avoiding a lot of data movement.
Yet a third use case is a user who has large amounts of
infrequently-accessed data that is stored replicated (because it was
uploaded prior to Swift's erasure-code support) and would like to
store it erasure-coded instead. The user will probably ask for Swift
to allow storage-policy changes at the container level, but as that is
fraught with peril, we can offer them this instead.
2. Proposed change
==================
Swift will gain the notion of a symbolic link ("symlink") object. This
object will reference another object. GET, HEAD, and OPTIONS
requests for a symlink object will operate on the referenced object.
DELETE and PUT requests for a symlink object will operate on the
symlink object, not the referenced object, and will delete or
overwrite it, respectively.
GET, HEAD, and OPTIONS requests can operate on a symlink object
instead of the referenced object by adding a query parameter
``?symlink=true`` to the request.
The ideal behaviour for POSTs would be for them to apply to the referenced
object, but due to Swift's eventually-consistent nature this is not possible.
Initially, it was suggested that POSTs should apply to the symlink directly,
and during a GET or HEAD both the symlink and referenced object's headers would be
compared and the newest returned. While this would work, the behaviour can be
rather odd if an application were to ever GET or HEAD the referenced object directly
as it would not contain any of the headers posted to the symlink.
Given all of this the best choice left is to fail a POST to a symlink and let
the application take care of it, namely by posting the referenced object
directly. Achieving this behaviour requires several changes:
1) To avoid a HEAD on every POST, the object server will be made aware of
symlinks and can detect their presence and fail appropriately.
2) Simply failing a POST in the object server when the object is a symlink will
not work; Consider the following scenarios:
Scenario A::
- Add a symlink
T0 - PUT /accnt/cont/obj?symlink=true
- Overwrite symlink with an regular object
T1 - PUT /accnt/cont/obj
- Assume at this point some of the primary nodes were down so handoff nodes
were used.
T2 - POST /accnt/cont/obj
- Depending on the object server hit it may see obj as either a symlink or a
regular object, though we know in time it will indeed be a real object.
Scenario B::
- Add a regular object
T0 - PUT /accnt/cont/obj
- Overwrite regular object with a symlink
T1 - PUT /accnt/cont/obj?symlink=true
- Assume at this point some of the primary nodes were down so handoff nodes
were used.
T2 - POST /accnt/cont/obj
- Depending on the object server hit it may see obj as either a symlink or a
regular object, though we know in time it will indeed be a symlink.
Given the scenarios above at T1 (i.e. during the post) it is possible some object
servers can see a symlink and others a regular object, thus it is not possible
to fail the POST of a symlink. Instead, the following behaviour will be
utilized, the object server will always apply the POST whether the object is a
symlink or a regular object. Next, we will still return an error to the client
if the object server believes it has seen a symlink. In scenario A) this would
imply the POST at T1 may fail but the update will indeed be applied to the
regular object, which is the correct behaviour. In scenario B) this would imply
the POST at T1 may fail but the update will indeed be applied to the symlink,
which while not ideal is not incorrect behaviour per say, and the error
returned to the application should cause it to apply the POST to the reference
object and given the initial point raised earlier this is indeed desirable.
The aim is for Swift symlinks to operate analogously to Unix symbolic
links (except where it does not make sense to do so).
2.1. Alternatives
-----------------
One could use a single-segment SLO manifest to achieve a similar
effect. However, the ETag of a SLO manifest is the MD5 of the ETags of
its segments, so using a single-segment SLO manifest changes the ETag
of the object. Also, object metadata (X-Object-Meta-\*) would have to
be copied to the SLO manifest since metadata from SLO segments does
not appear in the response. Further, SLO manifests contain the ETag of
the referenced segments, and if a segment changes, the manifest
becomes invalid. This is not a desirable property for symlinks.
A DLO manifest does not validate ETags, but it still fails to preserve
the referenced object's ETag and metadata, so it is also unsuitable.
Further, since DLOs are based on object name prefixes, the upload of a
new object (e.g. ``thesis.doc``, then later ``thesis.doc.old``) could
cause corrupted downloads.
Also, DLOs and SLOs cannot use each other as segments, while Swift
symlinks can reference DLOs and SLOs *and* act as segments in DLOs and
SLOs.
3. Client-facing API
====================
Clients create a Swift symlink by performing a zero-length PUT request
with the query parameter ``?symlink=true`` and the header
``X-Object-Symlink-Target-Object: <object>``.
For a cross-container symlink, also include the header
``X-Object-Symlink-Target-Container: <container>``. If omitted, it defaults to
the container of the symlink object.
For a cross-account symlink, also include the header
``X-Object-Symlink-Target-Account: <account>``. If omitted, it defaults to
the account of the symlink object.
Symlinks must be zero-byte objects. Attempting to PUT a symlink
with a nonempty request body will result in a 400-series error.
The referenced object need not exist at symlink-creation time. This
mimics the behavior of Unix symbolic links. Also, if we ever make bulk
uploads work with symbolic links in the tarballs, then we'll have to
avoid validation. ``tar`` just appends files to the archive as it
finds them; it does not push symbolic links to the back of the
archive. Thus, there's a 50% chance that any given symlink in a
tarball will precede its referent.
3.1 Example: Move an object to EC storage
-----------------------------------------
Assume the object is /v1/MY_acct/con/obj
1. Obtain an EC-storage-policy container either by finding a
pre-existing one or by making a container PUT request with the
right X-Storage-Policy header.
1. Make a COPY request to copy the object into the EC-policy
container, e.g.::
COPY /v1/MY_acct/con/obj
Destination: ec-con/obj
1. Overwrite the replicated object with a symlink object::
PUT /v1/MY_acct/con/obj?symlink=true
X-Object-Symlink-Target-Container: ec-con
X-Object-Symlink-Target-Object: obj
4. Interactions With Existing Features
======================================
4.1 COPY requests
-----------------
If you copy a symlink without ``?symlink=true``, you get a copy of the
referenced object. If you copy a symlink with ``?symlink=true``, you
get a copy of the symlink; it will refer to the same object,
container, and account.
However, if you copy a symlink without
``X-Object-Symlink-Target-Container`` between containers, or a symlink
without ``X-Object-Symlink-Target-Account`` between accounts, the new
symlink will refer to a different object.
4.2 Versioned Containers
------------------------
These will definitely interact. We should probably figure out how.
4.3 Object Expiration
---------------------
There's nothing special here. If you create the symlink with
``X-Delete-At``, the symlink will get deleted at the appropriate time.
If you use a plain POST to set ``X-Delete-At`` on a symlink, it gets
set on the referenced object just like other object metadata. If you
use POST with ``?symlink=true`` to set ``X-Delete-At`` on a symlink,
it will be set on the symlink itself.
4.4 Large Objects
-----------------
Since we'll almost certainly end up implementing symlinks as
middleware, we'll order the pipeline like this::
[pipeline:main]
pipeline = catch_errors ... slo dlo symlink ... proxy-server
This way, you can create a symlink whose target is a large object
*and* a large object can reference symlinks as segments.
This also works if we decide to implement symlinks in the proxy
server, though that would only happen if a compelling reason were
found.
4.5 User Authorization
----------------------
Authorization will be checked for both the symlink and the referenced
object. If the user is authorized to see the symlink but not the
referenced object, they'll get a 403, same as if they'd tried to
access the referenced object directly.
4.6. Quotas
-----------
Nothing special needed here. A symlink counts as 1 object toward an
object-count quota. Since symlinks are zero bytes, they do not count
toward a storage quota, and we do not need to write any code to make
that happen.
4.7 list_endpoints / Hadoop / ZeroVM
------------------------------------
If the application talks directly to the object server and fetches a
symlink, it's up to the application to deal with it. Applications that
bypass the proxy should either avoid use of symlinks or should know
how to handle them.
The same is true for SLO, DLO, versioning, erasure codes, and other
services that the Swift proxy server provides, so we are not without
precedent here.
4.8 Container Sync
------------------
Symlinks are synced like every other object. If the referenced object
in cluster A has a different container name than in cluster B, then
the symlink will point to the wrong place in one of the clusters.
Intra-container symlinks (those with only
``X-Object-Symlink-Target-Object``) will work correctly on both
clusters. Also, if containers are named identically on both clusters,
inter-container symlinks (those with
``X-Object-Symlink-Target-Object`` and
``X-Object-Symlink-Target-Container``) will work correctly too.
4.9 Bulk Uploads
----------------
Currently, bulk uploads ignore all non-file members in the uploaded
tarball. This could be expanded to also process symbolic-link members
(i.e. those for which ``tarinfo.issym() == True``) and create symlink
objects from them. This is not necessary for the initial
implementation of Swift symlinks, but it would be nice to have.
4.10 Swiftclient
----------------
python-swiftclient could download Swift symlinks as Unix symlinks if a
flag is given, or it could upload Unix symlinks as Swift symlinks in
some cases. This is not necessary for the initial implementation of
Swift symlinks, and is mainly mentioned here to show that
python-swiftclient was not forgotten.

View File

@ -1,84 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
==================
Test Specification
==================
This is a test specification. It should be removed after the first
real specification is merged.
Problem description
===================
A detailed description of the problem.
Proposed change
===============
Here is where you cover the change you propose to make in detail. How do you
propose to solve this problem?
If this is one part of a larger effort make it clear where this piece ends. In
other words, what's the scope of this effort?
Alternatives
------------
This is an optional section, where it does apply we'd just like a demonstration
that some thought has been put into why the proposed approach is the best one.
Implementation
==============
Assignee(s)
-----------
Who is leading the writing of the code? Or is this a blueprint where you're
throwing it out there to see who picks it up?
If more than one person is working on the implementation, please designate the
primary author and contact.
Primary assignee:
<launchpad-id or None>
Can optionally list additional ids if they intend on doing substantial
implementation work on this blueprint.
Work Items
----------
Work items or tasks -- break the feature up into the things that need to be
done to implement it. Those parts might end up being done by different people,
but we're mostly trying to understand the timeline for implementation.
Repositories
------------
Will any new git repositories need to be created?
Servers
-------
Will any new servers need to be created? What existing servers will
be affected?
DNS Entries
-----------
Will any other DNS entries need to be created or updated?
Dependencies
============
- Include specific references to specs and/or stories in infra, or in
other projects, that this one either depends on or is related to.
- Does this feature require any new library or program dependencies
not already in use?
- Does it require a new puppet module?

View File

@ -1,496 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
*************************
Automated Tiering Support
*************************
1. Problem Description
======================
Data hosted on long-term storage systems experience gradual changes in
access patterns as part of their information lifecycles. For example,
empirical studies by companies such as Facebook show that as image data
age beyond their creation times, they become more and more unlikely to be
accessed by users, with access rates dropping exponentially at times [1].
Long retention periods, as is the case with data stored on cold storage
systems like Swift, increase the possibility of such changes.
Tiering is an important feature provided by many traditional file & block
storage systems to deal with changes in data “temperature”. It enables
seamless movement of inactive data from high performance storage media to
low-cost, high capacity storage media to meet customers TCO (total cost of
ownership) requirements. As scale-out object storage systems like Swift are
starting to natively support multiple media types like SSD, HDD, tape and
different storage policies such as replication and erasure coding, it becomes
imperative to complement the wide range of available storage tiers (both
virtual and physical) with automated data tiering.
2. Tiering Use Cases in Swift
=============================
Swift users and operators can adapt to changes in access characteristics of
objects by transparently converting their storage policies to cater to the
goal of matching overall business needs ($/GB, performance, availability) with
where and how the objects are stored.
Some examples of how objects can be moved between Swift containers of different
storage policies as they age.
[SSD-based container] --> [HDD-based container]
[HDD-based container] --> [Tape-based container]
[Replication policy container] --> [Erasure coded policy container]
In some customer environments, a Swift container may not be the last storage
tier. Examples of archival-class stores lower in cost than Swift include
specialized tape-based systems [2], public cloud archival solutions such as
Amazon Glacier and Google Nearline storage. Analogous to this proposed feature
of tiering in Swift, Amazon S3 already has the in-built support to move
objects between S3 and Glacier based on user-defined rules. Redhat Ceph has
recently added tiering capabilities as well.
3. Goals
========
The main goal of this document is to propose a tiering feature in Swift that
enables seamless movement of objects between containers belonging to different
storage policies. It is “seamless” because users will not experience any
disruption in namespace, access API, or availability of the objects subject to
tiering.
Through new Swift API enhancements, Swift users and operators alike will have
the ability to specify a tiering relationship between two containers and the
associated data movement rules.
The focus of this proposal is to identify, create and bring together the
necessary building blocks towards a baseline tiering implementation natively
within Swift. While this narrow scope is intentional, the expectation is that
the baseline tiering implementation will lay the foundation and not preclude
more advanced tiering features in future.
4. Feature Dependencies
=======================
The following in-progress Swift features (aka specs) have been identified as
core dependencies for this tiering proposal.
1. Swift Symbolic Links [3]
2. Changing Storage Policies [4]
A few other specs are classified as nice-to-have dependencies, meaning that
if they evolve into full implementations we will be able to demonstrate the
tiering feature with advanced use cases and capabilities. However, they are
not considered mandatory requirements for the first version of tiering.
3. Metadata storage/search [5]
4. Tape support in Swift [6]
5. Implementation
=================
The proposed tiering implementation depends on several building blocks, some
of which are unique to tiering, like the requisite API changes. They will be
described in their entirety. Others like symlinks are independent features and
have uses beyond tiering. Instead of re-inventing the wheel, the tiering
implementation aims to leverage specific constructs that will be available
through these in-progress features.
5.1 Overview
------------
For a quick overview of the tiering implementation, please refer to the Figure
(images/tiering_overview.png). It highlights the flow of actions taking place
within the proposed tiering engine.
1. Swift client creates a tiering relationship between two Swift containers by
marking the source container with appropriate metadata.
2. A background process named tiering-coordinator examines the source container
and iterates through its objects.
3. Tiering-coordinator identifies candidate objects for movement and de-stages
each object to target container by issuing a copy request to an object server.
4. After an object is copied, tiering-coordinator replaces it by a symlink in
the source container pointing to corresponding object in target container.
5.2 API Changes
---------------
Swift clients will be able to create a tiering relationship between two
containers, i.e., source and target containers, by adding the following
metadata to the source container.
X-Container-Tiering-Target: <target_container_name>
X-Container-Tiering-Age: <threshold_object_age >
The metadata values can be set during the creation of the source container
(PUT) operation or they can be set later as part of a container metadata
update (POST) operation. Object age refers to the time elapsed since the
objects creation time (creation time is stored with the object as
X-Timestamp header).
The user semantics of setting the above container metadata are as follows.
When objects in the source container become older than the specified threshold
time, they become candidates for being de-staged to the target container. There
are no guarantees on when exactly they will be moved or the precise location of
the objects at any given time. Swift will operate on them asynchronously and
relocate objects based on user-specified tiering rules. Once the tiering
metadata is set on the source container, the user can expect levels of
performance, reliability, etc. for its objects commensurate with the storage
policy of either the source or target container.
One can override the tiering metadata for individual objects in the source
container by setting the following per-object metadata,
X-Object-Tiering-Target: <target_container_name>
X-Object-Tiering-Age: <object_age_in_minutes>
Presence of tiering metadata on an object will imply that it will take
precedence over the tiering metadata set on the hosting container. However,
if a container is not tagged with any tiering metadata, the objects inside it
will not be considered for tiering regardless of whether they are tagged with
any tiering related metadata or not. Also, if the tiering age threshold on the
object metadata is lower than the value set on the container, it will not take
effect until the container age criterion is met.
An important invariant preserved by the tiering feature is the namespace of
objects. As will be explained in later sections, after objects are moved they
will be replaced immediately by symlinks that will allow users to continue
foreground operations on objects as if no migrations have taken place. Please
refer to section 7 on open questions for further commentary on the API topic.
To summarize, here are the steps that a Swift user must perform in order to
initiate tiering between objects from a source container (S) to a target
container (T) over time.
1. Create containers S and T with desired storage policies, say replication
and erasure coding respectively
2. Set the tiering-related metadata (X-Container-Tiering-*) on container S
as described earlier in this section.
3. Deposit objects into container S.
4. If needed, override the default container settings for individual objects
inside container S by setting object metadata (X-Object-Tiering-*).
It will also be possible to create cascading tiering relationships between
more than two containers. For example, a sequence of tiering relationships
between containers C1 -> C2 -> C3 can be established by setting appropriate
tiering metadata on C1 and C2. When an object is old enough to be moved from
C1, it will be deposited in C2. The timer will then start on the moved object
in C2 and depending on the age settings on C2, the object will eventually be
migrated to C3.
5.3 Tiering Coordinator Process
-------------------------------
The tiering-coordinator is a background process similar to container-sync,
container-reconciler and other container-* processes running on each container
server. We can potentially re-use one of the existing container processes,
specifically either container-sync or container-reconciler to perform the job of
tiering-coordinator, but for the purposes of this discussion it will be assumed
that it is a separate process.
The key actions performed by tiering-coordinator are
(a) Walk through containers marked with tiering metadata
(b) Identify candidate objects for tiering within those containers
(c) Initiate copy requests on candidate objects
(d) Replace source objects with corresponding symlinks
We will discuss (a) and (b) in this section and cover (c) and (d) in subsequent
sections. Note that in the first version of tiering, only one metric
<object age> will be used to determine the eligibility of an object for
migration.
The tiering-coordinator performs its operations in a series of rounds. In each
round, it iterates through containers whose SQLite DBs it has direct access to
on the container server it is running on. It checks if the container has the
right X-Container-Tiering-* metadata. If present, it starts the scanning process
to identify candidate objects. The scanning process leverages a convenient (but
not necessary) property of the container DB that objects are listed in the
chronological order of their creation times. That is, the first index in the
container DB points to the object with oldest creation time, followed by next
younger object and so on. As such, the scanning process described below is
optimized for the object age criterion chosen for tiering v1 implementation.
For extending to other tiering metrics, we refer the reader to section 6.1 for
discussion.
Each container DB will have two persistent markers to track the progress of
tiering tiering_sync_start and tiering_sync_end. The marker tiering_sync_start
refers to the starting index in the container DB upto which objects have already
been processed. The marker tiering_sync_end refers to the index beyond which
objects have not yet been considered for tiering. All the objects that fall
between the two markers are the ones for which tiering is currently in progress.
Note that the presence of persistent markers in the container DB helps with
quickly resuming from previous work done in the event of container server
crash/reboot.
When a container is selected for tiering for the first time, both the markers
are initialized to -1. If the first object is old enough to meet the
X-Container-Tiering-Age criterion, tiering_sync_start is set to 0. Then the
second marker tiering_sync_end is advanced to an index that is lesser than
the two values - (i) tiering_sync_start + tier_max_objects_per_round (latter
will be a configurable value in /etc/swift/container.conf) or (ii) largest
index in the container DB whose corresponding object meets the tiering age
criterion.
The above marker settings will ensure two invariants. First, all objects
between (and including) tiering_sync_start and tiering_sync_end are candidates
for moving to the target container. Second, it will guarantee that the number
of objects processed on the container in a single round is bound by the
configuration parameter (tier_max_objects_per_round, say = 200). This will
ensure that the coordinator process will round robin effectively amongst all
containers on the server per round without spending undue amount of time on
only a few.
After the markers are fixed, tiering-coordinator will issue a copy request
for each object within the range. When the copy requests are completed, it
updates tiering_sync_start = tiering_sync_end and moves on to the next
container. When tiering-coordinator re-visits the same container after
completing the current round, it restarts the scanning routine described
above from tiering_sync_start = tiering_sync_end (except they are not both
-1 this time).
In a typical Swift cluster, each container DB is replicated three times and
resides on multiple container servers. Therefore, without proper
synchronization, tiering-coordinator processes can end up conflicting with
each other by processing the same container and same objects within. This
can potentially lead to race conditions with non-deterministic behavior. We
can overcome this issue by adopting the approach of divide-and-conquer
employed by container-sync process. The range of object indices between
(tiering_sync_start, tiering_sync_end) can be initially split up into as
many disjoint regions as the number of tiering-coordinator processes
operating on the same container. As they work through the object indices,
each process might additionally complete others portions depending on the
collective progress. For a detailed description of how container-sync
processes implicitly communicate and make group progress, please refer
to [7].
5.4 Object Copy Mechanism
-------------------------
For each candidate object that the tiering-coordinator deems eligible to move to
the target container, it issues an object copy request using an API call
supported by the object servers. The API call will map to a method used by
object-transferrer daemons running on the object servers. The
tiering-coordinator can select any of the object servers (by looking up the ring
datastructure corresponding to the object in source container policy) as a
destination for the request.
The object-transferrer daemon is supposed to be optimized for converting an
object from one storage policy to another. As per the Changing policies spec,
the object-transferrer daemon will be equipped with the right techniques to move
objects between Replication -> EC, EC -> EC, etc. Alternatively, in the absence
of object-transferrer, the tiering coordinator can simply make use of the
server-side COPY API that vanilla Swift exposes to regular clients. It can
send the COPY request to a swift proxy server to clone the source object into
the target container. The proxy server will perform the copy by first reading in
(GET request) the object from any of the source object servers and creating a
copy (PUT request) of the object in the target object servers. While this will
work correctly for the purposes of the tiering coordinator, making use of the
object-transferrer interface is likely to be a better option. Leveraging the
specialized code in object-transferrer through a well-defined interface for
copying an object between two different storage policy containers will make the
overall tiering process efficient.
Here is an example interface represented by a function call in the
object-transferrer code:
def copy_object(source_obj_path, target_obj_path)
The above method can be a wrapper over similar functionality used by the
object-transferrer daemon. The tiering-coordinator will use this interface to
call the function through a HTTP call.
copy_object(/A/S/O, /A/T/O)
where S is the source container and T is the target container. Note that the
object name in the target container will be the same as in the source container.
Upon receiving the copy request, the object server will first check if the
source path is a symlink object. If it is a symlink, it will respond with an
error to the tiering-coordinator to indicate that a symlink already exists.
This behavior will ensure idempotence and guard against situations where
tiering-coordinator crashes and retries a previously completed object copy
request. Also, it avoids tiering for sparse objects such as symlinks created
by users. Secondly, the object server will check if the source object has
tiering metadata in the form of X-Object-Tiering-* that overrides the default
tiering settings on the source container. It may or may not perform the object
copy depending on the result.
5.5 Symlink Creation
--------------------
After an object is successfully copied to the destination container, the
tiering-coordinator will issue a symlink create request to proxy server to
replace the source object by a reference to the destination object. Waiting
until the object copy is completed before replacing it by a symlink ensures
safety in case of failures. The system could end up with an extra target
object without a symlink pointing to it, but not the converse which
constitutes data loss. Note that the symlink feature is currently
work-in-progress and will also be available as an external API to swift clients.
When the symlink is created by the tiering-coordinator, it will need to ensure
that the original objects X-Timestamp value is preserved on the symlink
object. Therefore, it is proposed that in the symlink creation request, the
original time field can be provided (tiering-coordinator can quickly read the
original values from container DB entry) as object user metadata, which is
translated internally to a special sysmeta field by the symlink middleware.
On subsequent user requests, the sysmeta field storing the correct creation
timestamp will be sent to the user.
With the symlink successfully created, Swift users can continue to issue object
requests like GET, PUT to the original namespace /Account/Container/Object. The
Symlink middleware will ensure that the swift users do not notice the presence
of a symlink object unless a query parameter ?symlink=true [3] is explicitly
provided with the object request.
Users can also continue to read and update object metadata as before. It is not
entirely clear at the time of this writing if the symlink object will store a
copy of user metadata in its own extended attributes or if it will fetch the
metadata from the referenced object for every HEAD/GET on the object. We will
defer to whichever implementation that the symlink feature chooses to provide.
An interesting race condition is possible due to the time window between object
copy request and symlink creation. If there is an interim PUT request issued by
a swift user between the two, it will be overwritten by the internal symlink
created by the tiering-coordinator. This is an incorrect behavior that we need
to protect against. We can use the same technique [8] (with help of a second
vector timestamp) that container-reconciler uses to resolve a similar race
condition. The tiering-coordinator, at the time of symlink creation, can detect
the race condition and undo the COPY request. It will have to delete the object
that was created in the destination container. Though this is wasted work in
the face of such race conditions, we expect them to be rare scenarios. If the
user conceives tiering rules properly, there ought to be little to no
foreground traffic for the object that is being tiered.
6. Future Work
===============
6.1 Other Tiering Criteria
--------------------------
The first version of tiering implementation will be heavily tailored (especially
the scanning mechanism of tiering-coordinator) to the object age criterion. The
convenient property of container DBs that store objects in the same order as
they are created/overwritten lends to very efficient linear scanning for
candidate objects.
In the future, we should be able to support advanced criteria such as read
frequency counts, object size, metadata-based selection, etc. For example,
consider the following hypothetical criterion:
"Tier objects from container S to container T if older than 1 month AND size >
1GB AND tagged with metadata surveillance-video"
When the metadata search feature [5] is available in Swift, tiering-coordinator
should be able to run queries to quickly retrieve the set of object names that
match ad-hoc criteria on both user and system metadata. As the metadata search
feature evolves, we should be able to leverage it to add custom metadata such
as read counts, etc for our purposes.
6.2 Integration with External Storage Tiers
-------------------------------------------
The first implementation of tiering will only support object movement between
Swift containers. In order to establish a tiering relationship between a swift
container and an external storage backend, the backend must be mounted in Swift
as a native container through the DiskFile API or other integration mechanisms.
For instance, a target container fully hosted on GlusterFS or Seagate Kinetic
drives can be created through Swift-on-file or Kinetic DiskFile implementations
respectively.
The Swift community believes that a similar integration approach is necessary
to support external storage systems as tiering targets. There is already work
underway to integrate tape-based systems in Swift. In the same vein, future
work is needed to integrate external systems like Amazon Glacier or vendor
archival products via DiskFile drivers or other means.
7. Open Issues
==============
This section is structured as a series of questions and possible answers. With
more feedback from the swift community, the open issues will be resolved and
merged into the main document.
Q1: Can the target container exist on a different account than the source
container?
Ans: The proposed API assumes that the target container is always on the same
account as the source container. If this restriction is lifted, the proposed
API needs to be modified appropriately.
Q2: When the client sets the tiering metadata on the source container, should
the target container exist at that time? What if the user has no permissions on
the target container? When is all the error checking done?
Ans: The error checking can be deferred to the tiering-coordinator process. The
background process, upon detecting that the target container is unavailable can
skip performing any tiering activity on the source container and move on to the
next container. However, it might be better to detect errors in the client path
and report early. If the latter approach is chosen, middleware functionality is
needed to sanity check tiering metadata set on containers.
Q3: How is the target container presented to the client? Would it be just like
any other container with read/write permissions?
Ans: The target container will be just like any other container. The client is
responsible for manipulating the contents in the target container correctly. In
particular, it should be aware that there might be symlinks in source container
pointing to target objects. Deletions or overwrites of objects directly using
the target container namespace could render some symlinks useless or obsolete.
Q4: What is the behavior when conflicting tiering metadata are set over a
period of time. For example, if the tiering age threshold is increased on a
container with a POST metadata operation, will previously de-staged objects
be brought back to the source container to match the new tiering rule?
Ans: Perhaps not. The new tiering metadata should probably only be applied to
objects that have not yet been processed by tiering-coordinator. Previous
actions performed by tiering-coordinator based on older metadata need not be
reversed.
Q5: When a user issues a PUT operation to an object that has been de-staged to
the target container earlier, what is the behavior?
Ans: The default symlink behavior should apply but its not clear what it will
be. Will an overwrite PUT cause the symlink middleware to delete both the
symlink and the object being pointed to?
Q6: When a user issues a GET operation to an object that has been de-staged to
the target container earlier, will it be promoted back to source container?
Ans: The proposed implementation does not promote objects back to an upper tier
seamless to the user. If needed, such a behavior can be easily added with help
of a tiering middleware in the proxy server.
Q7: There is a mention of the ability to set cascading tiering relationships
between multiple containers, C1 -> C2 -> C3. What if there is a cycle in this
relationship graph?
Ans: A cycle should be prevented, else we can run into atleast one complicated
situation where a symlink might be pointing to an object on the same container
with the same name, thereby overwriting the symlink ! It is possible to detect
cycles at the time of tiering metadata creation in the client path with a
tiering-specific middleware that is entrusted with the cycle detection by
iterating through existing tiering relationships.
Q8: Are there any unexpected interactions of tiering with existing or new
features like SLO/DLO, encryption, container sharding, etc ?
Ans: SLO and DLO segments should continue to work as expected. If an object
server receives an object copy request for a SLO manifest object from a
tiering-coordinator, it will iteratively perform the copy for each constituent
object. Each constituent object will be replaced by a symlink. Encryption
should also work correctly as it is almost entirely orthogonal to the tiering
feature. Each object is treated as an opaque set of bytes by the tiering engine
and it does not pay any heed to whether the object is cipher text or not.
Dealing with container sharding might be tricky. Tiering-coordinator expects
to linearly walk through the indices of a container DB. If the container DB
is fragmented and stored in many different container servers, the scanning
process can get complicated. Any ideas there?
8. References
=============
1. http://www.enterprisetech.com/2013/10/25/facebook-loads-innovative-cold-storage-datacenter/
2. http://www-03.ibm.com/systems/storage/tape/
3. Symlinks in Swift. https://review.openstack.org/#/c/173609/
4. Changing storage policies in Swift. https://review.openstack.org/#/c/168761/
5. Add metadata search in Swift. https://review.openstack.org/#/c/180918/
6. Tape support in Swift. https://etherpad.openstack.org/p/liberty-swift-tape-storage
7. http://docs.openstack.org/developer/swift/overview_container_sync.html
8. Container reconciler section at http://docs.openstack.org/developer/swift/overview_policies.html

View File

@ -1,270 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
This template should be in ReSTructured text. Please do not delete
any of the sections in this template. If you have nothing to say
for a whole section, just write: "None". For help with syntax, see
http://sphinx-doc.org/rest.html To test out your formatting, see
http://www.tele3.cz/jbar/rest/rest.html
=========================
Updateable Object Sysmeta
=========================
The original system metadata patch ( https://review.openstack.org/#/c/51228/ )
supported only account and container system metadata.
There are now patches in review that store middleware-generated metadata
with objects, e.g.:
* on demand migration https://review.openstack.org/#/c/64430/
* server side encryption https://review.openstack.org/#/c/76578/1
Object system metadata should not be stored in the x-object-meta- user
metadata namespace because (a) there is a potential name conflict with
arbitrarily named user metadata and (b) system metadata in the x-object-meta-
namespace will be lost if a user sends a POST request to the object.
A patch is under review ( https://review.openstack.org/#/c/79991/ ) that will
persist system metadata that is included with an object PUT request,
and ignore system metadata sent with POSTs.
The goal of this work is to enable object system metadata to be persisted
AND updated. Unlike user metadata, it should be possible to update
individual items of system metadata independently when making a POST request
to an object server.
This work applies to fast-POST operation, not POST-as-copy operation.
Problem Description
===================
Item-by-item updates to metadata can be achieved by simple changes to the
metadata read-modify-write cycle during a POST to the object server: read
system metadata from existing data or meta file, merge new items,
write to a new meta file. However, concurrent POSTs to a single server or
inconsistent results between multiple servers can lead to multiple meta
files containing divergent sets of system metadata. These must be preserved
and eventually merged to achieve eventual consistency.
Proposed Change
===============
The proposed new behavior is to preserve multiple meta files in the obj_dir
until their system metadata is known to have been read and merged into a
newer meta file.
When constructing a diskfile object, all existing meta files that are newer
that the data file (usually just one) should be read for potential system
metadata contributions. To enable a per-item most-recent-wins semantic when
merging contributions from multiple meta files, system metadata should be
stored in meta files as `key: (value, timestamp)` pairs. This is not
necessary when system metadata is stored in a data file because the
timestamp of those items is known to be that of the data file.
When writing the diskfile during a POST, the merged set of system metadata
should be written to the new meta file, after which the older meta files can
be deleted.
This requires a change to the diskfile cleanup code (`hash_cleanup_listdir`).
After creating a new meta file, instead of deleting all older meta files,
only those that were either older than the data file or read during
construction of the new meta file are deleted.
In most cases the result will be same, but if a second concurrent request
has written a meta file that was not read by the first request handler then
this meta file will be left in place.
Similarly, a change is required in the async cleanup process (called by the
replicator daemon). The cleanup process should merge any existing meta files
into the most recent file before deleting older files. To reduce workload,
this merge process could be conditional upon a threshold number of meta
files being found.
Replication considerations
--------------------------
As a result of failures, object servers may have different existing meta
files for an object when a POST is handled and a new (merged) metadata set
is written to a new meta file. Consequently, object servers may end up with
identically timestamped meta files having different system metadata content.
rsync:
To differentiate between these meta files it is proposed to include a hash
of the metadata content in the name of the meta file. As a result,
meta files with differing content will be replicated between object servers
and their contents merged to achieve eventual consistency.
The timestamp part of the meta filename is still required in order to (a)
allow meta files older than a data or tombstone file to be deleted without
being read and (b) to continue to record the modification time of user
metadata.
ssync - TBD
Deleting system metadata
------------------------
An item of system metadata with key `x-object-sysmeta-x` should be deleted
when a header `x-object-sysmeta-x:""` is included with a POST request. This
can be achieved by persisting the system metadata item in meta files with an
empty value, i.e. `key : ("", timestamp)`, to indicate to any future metadata
merges that the item has been deleted. This guards against inclusion of
obsolete values from older meta files at the expense of storing the empty
value. The empty-valued system metadata may be finally removed during a
subsequent merge when it is observed that some expiry time has passed since
its timestamp (i.e. any older value that the empty value is overriding would
have been replicated by this time, so it is safe to delete the empty value).
Example
-------
Consider the following scenario. Initially the object dir on each object
server contains just the original data file::
obj_dir:
t1.data:
x-object-sysmeta-p: ('p1', t0)
Two concurrent POSTs update the object on servers A and B,
with timestamps t2 and t3, but fail on server C. One POST updates
`x-object-sysmeta-p` and adds `x-object-sysmeta-y`. The other POST adds
`x-object-sysmeta-z`. These POSTs result in two meta files being added to the
object directory on A and B::
obj_dir:
t1.data:
x-object-sysmeta-p: ('p1', t0)
t2.h2.meta:
x-object-sysmeta-p: ('p2', t2)
x-object-sysmeta-x: ('x1', t2)
x-object-sysmeta-y: ('y1', t2)
t3.h3.meta:
x-object-sysmeta-p: ('p1', t0)
x-object-sysmeta-x: ('x2', t3)
x-object-sysmeta-z: ('z1', t3)
(`hx` in filename represents hash of metadata)
A response to a subsequent HEAD request would contain the composition of the
two meta files' system metadata items::
x-object-sysmeta-p: 'p2'
x-object-sysmeta-x: 'x2'
x-object-sysmeta-y: 'y1'
x-object-sysmeta-z: 'z1'
A further POST request received at t4 deletes `x-object-sysmeta-p`. This
causes the two meta files to be read, their contents merged and a new meta
file to be written. This POST succeeds on all servers,
so on servers A and B we have::
obj_dir:
t1.data :
x-object-sysmeta-p: ('p1', t0)
t4.h4a.meta:
x-object-sysmeta-p: ('', t4)
x-object-sysmeta-x: ('x3', t3)
x-object-sysmeta-z: ('z1', t3)
x-object-sysmeta-y: ('y1', t2)
whereas on server C we have::
obj_dir:
t1.data :
x-object-sysmeta-p: ('p1', t0)
t4.h4b.meta:
x-object-sysmeta-p: ('', t4)
Eventually the meta files will be replicated between servers and merged,
leaving all servers with::
obj_dir:
t1.data :
x-object-sysmeta-p: ('p1', t0)
t4.h4a.meta:
x-object-sysmeta-p: ('', t4)
x-object-sysmeta-x: ('x3', t3)
x-object-sysmeta-z: ('z1', t3)
x-object-sysmeta-y: ('y1', t2)
Alternatives
------------
One alternative approach would be to preserve all meta files that are newer
than a data or tombstone file and never merge their contents. This removes
the need to include a hash in the meta file name, but has the obvious
disadvantage of accumulating an increasing number of files, each of which
needs to be read when constructing a diskfile.
Another alternative would store system metadata in separate `sysmeta` file.
It may then be possible to discard the timestamp from the filename (if the
`timestamp.hash` format is deemed too long).
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Alistair Coles (acoles)
Work Items
----------
TBD
Repositories
------------
None
Servers
-------
None
DNS Entries
-----------
None
Documentation
-------------
No change to external API docs. Developer docs would be updated to make
developers aware of the feature.
Security
--------
None
Testing
-------
Additional unit tests will be required for diskfile.py, object server. Probe
tests will be useful to verify replication behavior.
Dependencies
============
Patch for object system metadata on PUT only:
https://review.openstack.org/#/c/79991/
Spec for updating containers on fast-POST:
https://review.openstack.org/#/c/102592/
There is a mutual dependency between this spec and the spec to update
containers on fast-POST: the latter requires content-type to be treated as
an item of mutable system metadata, which this spec aims to enable. This
spec assumes that fast-POST becomes usable, which requires consistent
container updates to be enabled.

View File

@ -1,114 +0,0 @@
::
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
..
This template should be in ReSTructured text. Please do not delete
any of the sections in this template. If you have nothing to say
for a whole section, just write: "None". For help with syntax, see
http://sphinx-doc.org/rest.html To test out your formatting, see
http://www.tele3.cz/jbar/rest/rest.html
===============================
The Title of Your Specification
===============================
Include the URL of your blueprint:
https://blueprints.launchpad.net/swift/...
Introduction paragraph -- why are we doing anything?
Problem Description
===================
A detailed description of the problem.
Proposed Change
===============
Here is where you cover the change you propose to make in detail. How do you
propose to solve this problem?
If this is one part of a larger effort make it clear where this piece ends. In
other words, what's the scope of this effort?
Alternatives
------------
This is an optional section, where it does apply we'd just like a demonstration
that some thought has been put into why the proposed approach is the best one.
Implementation
==============
Assignee(s)
-----------
Who is leading the writing of the code? Or is this a blueprint where you're
throwing it out there to see who picks it up?
If more than one person is working on the implementation, please designate the
primary author and contact.
Primary assignee:
<launchpad-id or None>
Can optionally list additional ids if they intend on doing substantial
implementation work on this blueprint.
Work Items
----------
Work items or tasks -- break the feature up into the things that need to be
done to implement it. Those parts might end up being done by different people,
but we're mostly trying to understand the timeline for implementation.
Repositories
------------
Will any new git repositories need to be created?
Servers
-------
Will any new servers need to be created? What existing servers will
be affected?
DNS Entries
-----------
Will any other DNS entries need to be created or updated?
Documentation
-------------
Will this require a documentation change? If so, which documents?
Will it impact developer workflow? Will additional communication need
to be made?
Security
--------
Does this introduce any additional security risks, or are there
security-related considerations which should be discussed?
Testing
-------
What tests will be available or need to be constructed in order to
validate this? Unit/functional tests, development
environments/servers, etc.
Dependencies
============
- Include specific references to specs and/or stories in swift, or in
other projects, that this one either depends on or is related to.
- Does this feature require any new library or program dependencies
not already in use?
- Does it require a new puppet module?

View File

26
tox.ini
View File

@ -1,26 +0,0 @@
[tox]
minversion = 1.6
envlist = docs
skipsdist = True
[testenv]
usedevelop = True
install_command = pip install -U {opts} {packages}
setenv =
VIRTUAL_ENV={envdir}
deps = -r{toxinidir}/requirements.txt
-r{toxinidir}/test-requirements.txt
passenv = *_proxy *_PROXY
[testenv:venv]
commands = {posargs}
[testenv:docs]
commands = python setup.py build_sphinx
[testenv:spelling]
deps =
-r{toxinidir}/requirements.txt
sphinxcontrib-spelling
PyEnchant
commands = sphinx-build -b spelling doc/source doc/build/spelling