Merge "Chronicles of a DLM"
This commit is contained in:
commit
ece411aa6f
|
@ -0,0 +1,403 @@
|
||||||
|
==========================================
|
||||||
|
Chronicles of a distributed lock manager
|
||||||
|
==========================================
|
||||||
|
|
||||||
|
No blueprint, this is intended as a reference/consensus document.
|
||||||
|
|
||||||
|
The various OpenStack projects have an ongoing requirement to perform
|
||||||
|
some set of actions in an atomic manner performed by some distributed set of
|
||||||
|
applications on some set of distributed resources **without** having those
|
||||||
|
resources end up in some corrupted state due those actions being performed on
|
||||||
|
them without the traditional concept of `locking`_.
|
||||||
|
|
||||||
|
A `DLM`_ is one such concept/solution that can help (but not entirely
|
||||||
|
solve) these types of common resource manipulation patterns in distributed
|
||||||
|
systems. This specification will be an attempt at defining the problem
|
||||||
|
space, understanding what each project *currently* has done in regards of
|
||||||
|
creating its own `DLM`_-like entity and how we can make the situation better
|
||||||
|
by coming to consensus on a common solution that we can benefit from to
|
||||||
|
make everyone's lives (developers, operators and users of OpenStack
|
||||||
|
projects) that much better. Such a consensus being built will also
|
||||||
|
influence the future functionality and capabilities of OpenStack at large
|
||||||
|
so we need to be **especially** careful, thoughtful, and explicit here.
|
||||||
|
|
||||||
|
.. _DLM: https://en.wikipedia.org/wiki/Distributed_lock_manager
|
||||||
|
.. _locking: https://en.wikipedia.org/wiki/Lock_%28computer_science%29
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
Building distributed systems is **hard**. It is especially hard when the
|
||||||
|
distributed system (and the applications ``[X, Y, Z...]`` that compose the
|
||||||
|
parts of that system) manipulate mutable resources without the ability to do
|
||||||
|
so in a conflict-free, highly available, and
|
||||||
|
scalable manner (for example, application ``X`` on machine ``1`` resizes
|
||||||
|
volume ``A``, while application ``Y`` on machine ``2`` is writing files to
|
||||||
|
volume ``A``). Typically in local applications (running on a single
|
||||||
|
machine) these types of conflicts are avoided by using primitives provided
|
||||||
|
by the operating system (`pthreads`_ for example, or filesystem locks, or
|
||||||
|
other similar `CAS`_ like operations provided by the `processor instruction`_
|
||||||
|
set). In distributed systems these types of solutions do **not** work, so
|
||||||
|
alternatives have to either be invented or provided by some
|
||||||
|
other service (for example one of the many academia has created, such
|
||||||
|
as `raft`_ and/or other `paxos`_ variants, or services created
|
||||||
|
from these papers/concepts such as `zookeeper`_ or `chubby`_ or one of the
|
||||||
|
many `raft implementations`_ or the redis `redlock`_ algorithm). Sadly in
|
||||||
|
OpenStack this has meant that there are now multiple implementations/inventions
|
||||||
|
of such concepts (most using some variation of database locking), using
|
||||||
|
different techniques to achieve the defined goal (conflict-free, highly
|
||||||
|
available, and scalable manipulation of resources). To make things worse
|
||||||
|
some projects still desire to have this concept and have not reached the
|
||||||
|
point where it is needed (or they have reached this point but have been
|
||||||
|
unable to achieve consensus around an implementation and/or
|
||||||
|
direction). Overall this diversity, while nice for inventors and people
|
||||||
|
that like to explore these concepts does **not** appear to be the best
|
||||||
|
solution we can provide to operators, developers inside the
|
||||||
|
community, deployers and other users of the now (and every expanding) diverse
|
||||||
|
set of `OpenStack projects`_.
|
||||||
|
|
||||||
|
.. _redlock: http://redis.io/topics/distlock
|
||||||
|
.. _pthreads: http://man7.org/linux/man-pages/man7/pthreads.7.html
|
||||||
|
.. _CAS: https://en.wikipedia.org/wiki/Compare-and-swap
|
||||||
|
.. _processor instruction: http://www.felixcloutier.com/x86/CMPXCHG.html
|
||||||
|
.. _paxos: https://en.wikipedia.org/wiki/Paxos_%28computer_science%29
|
||||||
|
.. _raft: http://raftconsensus.github.io/
|
||||||
|
.. _zookeeper: https://en.wikipedia.org/wiki/Apache_ZooKeeper
|
||||||
|
.. _chubby: http://research.google.com/archive/chubby.html
|
||||||
|
.. _raft implementations: http://raftconsensus.github.io/#implementations
|
||||||
|
.. _OpenStack projects: http://git.openstack.org/cgit/openstack/\
|
||||||
|
governance/tree/reference/projects.yaml
|
||||||
|
|
||||||
|
What has been created
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
To show the current diversity let's dive slightly into what *some* of the
|
||||||
|
projects have created and/or used to resolve the problems mentioned above.
|
||||||
|
|
||||||
|
Cinder
|
||||||
|
******
|
||||||
|
|
||||||
|
**Problem:**
|
||||||
|
|
||||||
|
Avoid multiple entities from manipulating the same volume resource(s)
|
||||||
|
at the same time while still being scalable and highly available.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
Currently is limited to file locks and basic volume state transitions. Has
|
||||||
|
limited scalability and reliability of cinder under failure/load; has been
|
||||||
|
worked on for a while to attempt to create a solution that will fix some of
|
||||||
|
these fundamental issues.
|
||||||
|
|
||||||
|
**Notes:**
|
||||||
|
|
||||||
|
- For further reading/details these links can/may offer more insight.
|
||||||
|
|
||||||
|
- https://review.openstack.org/#/c/149894/
|
||||||
|
- https://review.openstack.org/#/c/202615/
|
||||||
|
- https://etherpad.openstack.org/p/mitaka-cinder-volmgr-locks
|
||||||
|
- https://etherpad.openstack.org/p/mitaka-cinder-cvol-aa
|
||||||
|
- (and more)
|
||||||
|
|
||||||
|
Ironic
|
||||||
|
******
|
||||||
|
|
||||||
|
**Problem:**
|
||||||
|
|
||||||
|
Avoid multiple conductors from manipulating the same bare-metal
|
||||||
|
instances and/or nodes at the same time while still being scalable and
|
||||||
|
highly available.
|
||||||
|
|
||||||
|
Other required/implemented functionality:
|
||||||
|
|
||||||
|
* Track what services are running, supporting what drivers, and rebalance
|
||||||
|
work when service state changes (service discovery and rebalancing).
|
||||||
|
* Sync state of temporary agents instead of polling or heartbeats.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
Partition resources onto a hash-ring to allow for ownership to be scaled
|
||||||
|
out among many conductors as needed. To avoid entities in that hash-ring
|
||||||
|
from manipulating the same resource/node that they both may co-own a database
|
||||||
|
lock is used to ensure single ownership. Actions taken on nodes are performed
|
||||||
|
after the lock (shared or exclusive) has been obtained (a `state machine`_
|
||||||
|
built using `automaton`_ also helps ensure only valid transitions
|
||||||
|
are performed).
|
||||||
|
|
||||||
|
**Notes:**
|
||||||
|
|
||||||
|
- Has logic for shared and exclusive locks and provisions for upgrading
|
||||||
|
a shared lock to an exclusive lock as needed (only one exclusive lock
|
||||||
|
on a given row/key may exist at the same time).
|
||||||
|
- Reclaim/take over lock mechanism via periodic heartbeats into the
|
||||||
|
database (reclaims is apparently a manual and clunky process).
|
||||||
|
|
||||||
|
**Code/doc references:**
|
||||||
|
|
||||||
|
- Some of the current issues listed at `pluggable-locking`_.
|
||||||
|
|
||||||
|
- `Etcd`_ proposed @ `179965`_ I believe this further validates the view
|
||||||
|
that we need a consensus on a uniform solution around DLM (vs continually
|
||||||
|
having projects implement whatever suites there fancy/flavor of the week).
|
||||||
|
|
||||||
|
- https://github.com/openstack/ironic/blob/master/ironic/conductor/task_manager.py#L20
|
||||||
|
- https://github.com/openstack/ironic/blob/master/ironic/conductor/task_manager.py#L222
|
||||||
|
|
||||||
|
.. _state machine: http://docs.openstack.org/developer/ironic/dev/states.html
|
||||||
|
.. _automaton: http://docs.openstack.org/developer/automaton/
|
||||||
|
.. _179965: https://review.openstack.org/#/c/179965
|
||||||
|
.. _Etcd: https://github.com/coreos/etcd
|
||||||
|
.. _pluggable-locking: https://blueprints.launchpad.net/ironic/+spec/pluggable-locking
|
||||||
|
|
||||||
|
Heat
|
||||||
|
****
|
||||||
|
|
||||||
|
**Problem:**
|
||||||
|
|
||||||
|
Multiple engines working on the same stack (or nested stack of). The
|
||||||
|
ongoing convergence rework may change this state of the world (so in the
|
||||||
|
future the problem space might be slightly different, but the concept
|
||||||
|
of requiring locks on resources will still exist).
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
Lock a stack using a database lock and disallow other engines
|
||||||
|
from working on that same stack (or stack inside of it if nested),
|
||||||
|
using expiry/staleness allow other engines to claim potentially
|
||||||
|
lost lock after period of time.
|
||||||
|
|
||||||
|
**Notes:**
|
||||||
|
|
||||||
|
- Liveness of stack lock not easy to determine? For example is an engine
|
||||||
|
just taking a long time working on a stack, has the engine had a network
|
||||||
|
partition from the database but is still operational, or has the engine
|
||||||
|
really died?
|
||||||
|
|
||||||
|
- To resolve this a combination of an ``oslo.messaging`` ping used to
|
||||||
|
determine when a lock may be dead (or the owner of it is dead), if an
|
||||||
|
engine is non-responsive to pings/pongs after period of time (and its
|
||||||
|
associated database entry has expired) then stealing is allowed to occur.
|
||||||
|
|
||||||
|
- Lacks *simple* introspection capabilities? For example it is necessary
|
||||||
|
to examine the database or log files to determine who is trying to acquire
|
||||||
|
the lock, how long they have waited and so on.
|
||||||
|
|
||||||
|
- Lock releasing may fail (which is highly undesirable, *IMHO* it should
|
||||||
|
**never** be possible to fail releasing a lock); implementation does not
|
||||||
|
automatically release locks on application crash/disconnect/other but relies
|
||||||
|
on ping/pongs and database updating (each operation in this
|
||||||
|
complex 'stealing dance' may fail or be problematic, and therefore is not
|
||||||
|
especially simple).
|
||||||
|
|
||||||
|
**Code/doc references:**
|
||||||
|
|
||||||
|
- http://docs.openstack.org/developer/heat/_modules/heat/engine/stack_lock.html
|
||||||
|
- https://github.com/openstack/heat/blob/master/heat/engine/resource.py#L1307
|
||||||
|
|
||||||
|
Ceilometer and Sahara
|
||||||
|
*********************
|
||||||
|
|
||||||
|
**Problem:**
|
||||||
|
|
||||||
|
Distributing tasks across central agents.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
Token ring based on `tooz`_.
|
||||||
|
|
||||||
|
**Notes:**
|
||||||
|
|
||||||
|
Your project here
|
||||||
|
*****************
|
||||||
|
|
||||||
|
Solution analysis
|
||||||
|
=================
|
||||||
|
|
||||||
|
The proposed change would be to choose one of the following:
|
||||||
|
|
||||||
|
- Select a distributed lock manager (one that is opensource) and integrate
|
||||||
|
it *deeply* into openstack, work with the community that owns it to develop
|
||||||
|
and issues (or fix any found bugs) and use it for lock management
|
||||||
|
functionality and service discovery...
|
||||||
|
- Select a API (likely `tooz`_) that will be backed by capable
|
||||||
|
distributed lock manager(s) and integrate it *deeply* into openstack and
|
||||||
|
use it for lock management functionality and service discovery...
|
||||||
|
|
||||||
|
* `zookeeper`_ (`community respected
|
||||||
|
analysis <https://aphyr.com/posts/291-call-me-maybe-zookeeper>`__)
|
||||||
|
* `consul`_ (`community respected
|
||||||
|
analysis <https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul>`__)
|
||||||
|
* `etc.d`_ (`community respected
|
||||||
|
analysis <https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul>`__)
|
||||||
|
|
||||||
|
Zookeeper
|
||||||
|
---------
|
||||||
|
|
||||||
|
Summary:
|
||||||
|
|
||||||
|
Age: around 8 years
|
||||||
|
|
||||||
|
* Changelog was created in svn repository on aug 27, 2007.
|
||||||
|
|
||||||
|
License: Apache License 2.0
|
||||||
|
|
||||||
|
Approximate community size:
|
||||||
|
|
||||||
|
Features (overview):
|
||||||
|
|
||||||
|
- `Zab`_ based (paxos variant)
|
||||||
|
- Reliable filesystem like-storage (see `zk data model`_)
|
||||||
|
- Mature (and widely used) python client (via `kazoo`_)
|
||||||
|
- Mature shell/REPL interface (via `zkshell`_)
|
||||||
|
- Ephemeral nodes (filesystem entries that are tied to presence
|
||||||
|
of their creator)
|
||||||
|
- Self-cleaning trees (implemented in 3.5.0 via
|
||||||
|
https://issues.apache.org/jira/browse/ZOOKEEPER-2163)
|
||||||
|
- Dynamic reconfiguration (making upgrades/membership changes that
|
||||||
|
much easier to get right)
|
||||||
|
- https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html
|
||||||
|
|
||||||
|
Operability:
|
||||||
|
|
||||||
|
- Rolling restarts < 3.5.0 (to allow for upgrades to happen)
|
||||||
|
- Starting >= 3.5.0, 'rolling restarts' are no longer needed (see
|
||||||
|
mention of dynamic reconfiguration above)
|
||||||
|
- Java stack experience required
|
||||||
|
|
||||||
|
Language written in: java
|
||||||
|
|
||||||
|
.. _kazoo: http://kazoo.readthedocs.org/
|
||||||
|
.. _zkshell: https://pypi.python.org/pypi/zk_shell/
|
||||||
|
.. _zk data model: http://zookeeper.apache.org/doc/\
|
||||||
|
trunk/zookeeperProgrammers.html#ch_zkDataModel
|
||||||
|
.. _Zab: https://web.stanford.edu/class/cs347/reading/zab.pdf
|
||||||
|
|
||||||
|
Packaged: yes (at least on ubuntu and fedora)
|
||||||
|
|
||||||
|
* http://packages.ubuntu.com/trusty/java/zookeeperd
|
||||||
|
* https://apps.fedoraproject.org/packages/zookeeper
|
||||||
|
|
||||||
|
Consul
|
||||||
|
------
|
||||||
|
|
||||||
|
Summary:
|
||||||
|
|
||||||
|
Age: around 1.5 years
|
||||||
|
|
||||||
|
* Repository changelog denotes added in april 2014.
|
||||||
|
|
||||||
|
License: Mozilla Public License, version 2.0
|
||||||
|
|
||||||
|
Approximate community size:
|
||||||
|
|
||||||
|
Features (overview):
|
||||||
|
|
||||||
|
- Raft based
|
||||||
|
- DNS interface
|
||||||
|
- HTTP interface
|
||||||
|
- Reliable K/V storage
|
||||||
|
- Suited for multi-datacenter usage
|
||||||
|
- Python client (via `python-consul`_)
|
||||||
|
|
||||||
|
.. _python-consul: https://pypi.python.org/pypi/python-consul
|
||||||
|
.. _consul: https://www.consul.io/
|
||||||
|
|
||||||
|
Operability:
|
||||||
|
|
||||||
|
* Go stack experience required
|
||||||
|
|
||||||
|
Language written in: go
|
||||||
|
|
||||||
|
Packaged: somewhat (at least on ubuntu and fedora)
|
||||||
|
|
||||||
|
* Ppa at https://launchpad.net/~bcandrea/+archive/ubuntu/consul
|
||||||
|
* https://admin.fedoraproject.org/pkgdb/package/consul/ (?)
|
||||||
|
|
||||||
|
Etc.d
|
||||||
|
-----
|
||||||
|
|
||||||
|
Summary:
|
||||||
|
|
||||||
|
Age: Around 1.09 years old
|
||||||
|
|
||||||
|
License: Apache License 2.0
|
||||||
|
|
||||||
|
Approximate community size:
|
||||||
|
|
||||||
|
Features (overview):
|
||||||
|
|
||||||
|
Language written in: go
|
||||||
|
|
||||||
|
Operability:
|
||||||
|
|
||||||
|
* Go stack experience required
|
||||||
|
|
||||||
|
Packaged: ?
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
Place all functionality behind `tooz`_ (as much as possible) and let the
|
||||||
|
operator choose which implementation to use. Do note that functionality that
|
||||||
|
is not possible in all backends (for example consul provides a `DNS`_ interface
|
||||||
|
that complements its HTTP REST interface) will not be able to be exposed
|
||||||
|
through a `tooz`_ API, so this may limit the developer using `tooz`_ to
|
||||||
|
implement some feature/s).
|
||||||
|
|
||||||
|
Compliance: further details about what each `tooz`_ driver must
|
||||||
|
conform to (as in regard to how it operates, what functionality it must support
|
||||||
|
and under what consistency, availability, and partition tolerance scheme
|
||||||
|
it must operate under) will be detailed at: `240645`_
|
||||||
|
|
||||||
|
It is expected as the result of `240645`_ that
|
||||||
|
certain existing `tooz`_ drivers will be deprecated and eventually removed
|
||||||
|
after a given number of cycles (due to there inherent inability to meet the
|
||||||
|
policy constraints created by that specification) so that the quality
|
||||||
|
and consistency of there operating policy can be guaranteed (this guarantee
|
||||||
|
reduces the divergence in implementations that makes plugins that much
|
||||||
|
harder to diagnosis, debug, and validate).
|
||||||
|
|
||||||
|
.. Note::
|
||||||
|
|
||||||
|
Do note that the `tooz`_ alternative which needs to be understood
|
||||||
|
is that `tooz`_ is a tiny layer around solutions mentioned above, which
|
||||||
|
is an admirable goal (I guess I can say this since I helped make that
|
||||||
|
library) but it does favor pluggability over picking one solution and
|
||||||
|
making it better. This is obviously a trade-off that must IMHO **not** be
|
||||||
|
ignored (since ``X`` solutions mean that it becomes that much harder to
|
||||||
|
diagnose and fix upstream issues because ``X - Y`` solutions may not have
|
||||||
|
the issue in the first place); TLDR: pluggability comes at a cost.
|
||||||
|
|
||||||
|
.. _DNS: http://www.consul.io/docs/agent/dns.html
|
||||||
|
.. _tooz: http://docs.openstack.org/developer/tooz/
|
||||||
|
.. _240645: https://review.openstack.org/#/c/240645/
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
- All the reviewers, code creators, PTL(s) of OpenStack?
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
||||||
|
|
||||||
|
.. list-table:: Revisions
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Release Name
|
||||||
|
- Description
|
||||||
|
* - Mitaka
|
||||||
|
- Introduced
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
Loading…
Reference in New Issue