Chronicles of a DLM
Distributed locks, the concept, the problem space, what has been done, what could be done, what will be done, what must be done, the journey begins now (weekdays at 7pm EST). Updating due to summit session result, some more small work and/or tweaks still needed. Change-Id: Ibce47659c1108b94b4d538c63da69ce0371aed04
This commit is contained in:
parent
a492cacfce
commit
2bd1df191c
|
@ -0,0 +1,403 @@
|
|||
==========================================
|
||||
Chronicles of a distributed lock manager
|
||||
==========================================
|
||||
|
||||
No blueprint, this is intended as a reference/consensus document.
|
||||
|
||||
The various OpenStack projects have an ongoing requirement to perform
|
||||
some set of actions in an atomic manner performed by some distributed set of
|
||||
applications on some set of distributed resources **without** having those
|
||||
resources end up in some corrupted state due those actions being performed on
|
||||
them without the traditional concept of `locking`_.
|
||||
|
||||
A `DLM`_ is one such concept/solution that can help (but not entirely
|
||||
solve) these types of common resource manipulation patterns in distributed
|
||||
systems. This specification will be an attempt at defining the problem
|
||||
space, understanding what each project *currently* has done in regards of
|
||||
creating its own `DLM`_-like entity and how we can make the situation better
|
||||
by coming to consensus on a common solution that we can benefit from to
|
||||
make everyone's lives (developers, operators and users of OpenStack
|
||||
projects) that much better. Such a consensus being built will also
|
||||
influence the future functionality and capabilities of OpenStack at large
|
||||
so we need to be **especially** careful, thoughtful, and explicit here.
|
||||
|
||||
.. _DLM: https://en.wikipedia.org/wiki/Distributed_lock_manager
|
||||
.. _locking: https://en.wikipedia.org/wiki/Lock_%28computer_science%29
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Building distributed systems is **hard**. It is especially hard when the
|
||||
distributed system (and the applications ``[X, Y, Z...]`` that compose the
|
||||
parts of that system) manipulate mutable resources without the ability to do
|
||||
so in a conflict-free, highly available, and
|
||||
scalable manner (for example, application ``X`` on machine ``1`` resizes
|
||||
volume ``A``, while application ``Y`` on machine ``2`` is writing files to
|
||||
volume ``A``). Typically in local applications (running on a single
|
||||
machine) these types of conflicts are avoided by using primitives provided
|
||||
by the operating system (`pthreads`_ for example, or filesystem locks, or
|
||||
other similar `CAS`_ like operations provided by the `processor instruction`_
|
||||
set). In distributed systems these types of solutions do **not** work, so
|
||||
alternatives have to either be invented or provided by some
|
||||
other service (for example one of the many academia has created, such
|
||||
as `raft`_ and/or other `paxos`_ variants, or services created
|
||||
from these papers/concepts such as `zookeeper`_ or `chubby`_ or one of the
|
||||
many `raft implementations`_ or the redis `redlock`_ algorithm). Sadly in
|
||||
OpenStack this has meant that there are now multiple implementations/inventions
|
||||
of such concepts (most using some variation of database locking), using
|
||||
different techniques to achieve the defined goal (conflict-free, highly
|
||||
available, and scalable manipulation of resources). To make things worse
|
||||
some projects still desire to have this concept and have not reached the
|
||||
point where it is needed (or they have reached this point but have been
|
||||
unable to achieve consensus around an implementation and/or
|
||||
direction). Overall this diversity, while nice for inventors and people
|
||||
that like to explore these concepts does **not** appear to be the best
|
||||
solution we can provide to operators, developers inside the
|
||||
community, deployers and other users of the now (and every expanding) diverse
|
||||
set of `OpenStack projects`_.
|
||||
|
||||
.. _redlock: http://redis.io/topics/distlock
|
||||
.. _pthreads: http://man7.org/linux/man-pages/man7/pthreads.7.html
|
||||
.. _CAS: https://en.wikipedia.org/wiki/Compare-and-swap
|
||||
.. _processor instruction: http://www.felixcloutier.com/x86/CMPXCHG.html
|
||||
.. _paxos: https://en.wikipedia.org/wiki/Paxos_%28computer_science%29
|
||||
.. _raft: http://raftconsensus.github.io/
|
||||
.. _zookeeper: https://en.wikipedia.org/wiki/Apache_ZooKeeper
|
||||
.. _chubby: http://research.google.com/archive/chubby.html
|
||||
.. _raft implementations: http://raftconsensus.github.io/#implementations
|
||||
.. _OpenStack projects: http://git.openstack.org/cgit/openstack/\
|
||||
governance/tree/reference/projects.yaml
|
||||
|
||||
What has been created
|
||||
---------------------
|
||||
|
||||
To show the current diversity let's dive slightly into what *some* of the
|
||||
projects have created and/or used to resolve the problems mentioned above.
|
||||
|
||||
Cinder
|
||||
******
|
||||
|
||||
**Problem:**
|
||||
|
||||
Avoid multiple entities from manipulating the same volume resource(s)
|
||||
at the same time while still being scalable and highly available.
|
||||
|
||||
**Solution:**
|
||||
|
||||
Currently is limited to file locks and basic volume state transitions. Has
|
||||
limited scalability and reliability of cinder under failure/load; has been
|
||||
worked on for a while to attempt to create a solution that will fix some of
|
||||
these fundamental issues.
|
||||
|
||||
**Notes:**
|
||||
|
||||
- For further reading/details these links can/may offer more insight.
|
||||
|
||||
- https://review.openstack.org/#/c/149894/
|
||||
- https://review.openstack.org/#/c/202615/
|
||||
- https://etherpad.openstack.org/p/mitaka-cinder-volmgr-locks
|
||||
- https://etherpad.openstack.org/p/mitaka-cinder-cvol-aa
|
||||
- (and more)
|
||||
|
||||
Ironic
|
||||
******
|
||||
|
||||
**Problem:**
|
||||
|
||||
Avoid multiple conductors from manipulating the same bare-metal
|
||||
instances and/or nodes at the same time while still being scalable and
|
||||
highly available.
|
||||
|
||||
Other required/implemented functionality:
|
||||
|
||||
* Track what services are running, supporting what drivers, and rebalance
|
||||
work when service state changes (service discovery and rebalancing).
|
||||
* Sync state of temporary agents instead of polling or heartbeats.
|
||||
|
||||
**Solution:**
|
||||
|
||||
Partition resources onto a hash-ring to allow for ownership to be scaled
|
||||
out among many conductors as needed. To avoid entities in that hash-ring
|
||||
from manipulating the same resource/node that they both may co-own a database
|
||||
lock is used to ensure single ownership. Actions taken on nodes are performed
|
||||
after the lock (shared or exclusive) has been obtained (a `state machine`_
|
||||
built using `automaton`_ also helps ensure only valid transitions
|
||||
are performed).
|
||||
|
||||
**Notes:**
|
||||
|
||||
- Has logic for shared and exclusive locks and provisions for upgrading
|
||||
a shared lock to an exclusive lock as needed (only one exclusive lock
|
||||
on a given row/key may exist at the same time).
|
||||
- Reclaim/take over lock mechanism via periodic heartbeats into the
|
||||
database (reclaims is apparently a manual and clunky process).
|
||||
|
||||
**Code/doc references:**
|
||||
|
||||
- Some of the current issues listed at `pluggable-locking`_.
|
||||
|
||||
- `Etcd`_ proposed @ `179965`_ I believe this further validates the view
|
||||
that we need a consensus on a uniform solution around DLM (vs continually
|
||||
having projects implement whatever suites there fancy/flavor of the week).
|
||||
|
||||
- https://github.com/openstack/ironic/blob/master/ironic/conductor/task_manager.py#L20
|
||||
- https://github.com/openstack/ironic/blob/master/ironic/conductor/task_manager.py#L222
|
||||
|
||||
.. _state machine: http://docs.openstack.org/developer/ironic/dev/states.html
|
||||
.. _automaton: http://docs.openstack.org/developer/automaton/
|
||||
.. _179965: https://review.openstack.org/#/c/179965
|
||||
.. _Etcd: https://github.com/coreos/etcd
|
||||
.. _pluggable-locking: https://blueprints.launchpad.net/ironic/+spec/pluggable-locking
|
||||
|
||||
Heat
|
||||
****
|
||||
|
||||
**Problem:**
|
||||
|
||||
Multiple engines working on the same stack (or nested stack of). The
|
||||
ongoing convergence rework may change this state of the world (so in the
|
||||
future the problem space might be slightly different, but the concept
|
||||
of requiring locks on resources will still exist).
|
||||
|
||||
**Solution:**
|
||||
|
||||
Lock a stack using a database lock and disallow other engines
|
||||
from working on that same stack (or stack inside of it if nested),
|
||||
using expiry/staleness allow other engines to claim potentially
|
||||
lost lock after period of time.
|
||||
|
||||
**Notes:**
|
||||
|
||||
- Liveness of stack lock not easy to determine? For example is an engine
|
||||
just taking a long time working on a stack, has the engine had a network
|
||||
partition from the database but is still operational, or has the engine
|
||||
really died?
|
||||
|
||||
- To resolve this a combination of an ``oslo.messaging`` ping used to
|
||||
determine when a lock may be dead (or the owner of it is dead), if an
|
||||
engine is non-responsive to pings/pongs after period of time (and its
|
||||
associated database entry has expired) then stealing is allowed to occur.
|
||||
|
||||
- Lacks *simple* introspection capabilities? For example it is necessary
|
||||
to examine the database or log files to determine who is trying to acquire
|
||||
the lock, how long they have waited and so on.
|
||||
|
||||
- Lock releasing may fail (which is highly undesirable, *IMHO* it should
|
||||
**never** be possible to fail releasing a lock); implementation does not
|
||||
automatically release locks on application crash/disconnect/other but relies
|
||||
on ping/pongs and database updating (each operation in this
|
||||
complex 'stealing dance' may fail or be problematic, and therefore is not
|
||||
especially simple).
|
||||
|
||||
**Code/doc references:**
|
||||
|
||||
- http://docs.openstack.org/developer/heat/_modules/heat/engine/stack_lock.html
|
||||
- https://github.com/openstack/heat/blob/master/heat/engine/resource.py#L1307
|
||||
|
||||
Ceilometer and Sahara
|
||||
*********************
|
||||
|
||||
**Problem:**
|
||||
|
||||
Distributing tasks across central agents.
|
||||
|
||||
**Solution:**
|
||||
|
||||
Token ring based on `tooz`_.
|
||||
|
||||
**Notes:**
|
||||
|
||||
Your project here
|
||||
*****************
|
||||
|
||||
Solution analysis
|
||||
=================
|
||||
|
||||
The proposed change would be to choose one of the following:
|
||||
|
||||
- Select a distributed lock manager (one that is opensource) and integrate
|
||||
it *deeply* into openstack, work with the community that owns it to develop
|
||||
and issues (or fix any found bugs) and use it for lock management
|
||||
functionality and service discovery...
|
||||
- Select a API (likely `tooz`_) that will be backed by capable
|
||||
distributed lock manager(s) and integrate it *deeply* into openstack and
|
||||
use it for lock management functionality and service discovery...
|
||||
|
||||
* `zookeeper`_ (`community respected
|
||||
analysis <https://aphyr.com/posts/291-call-me-maybe-zookeeper>`__)
|
||||
* `consul`_ (`community respected
|
||||
analysis <https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul>`__)
|
||||
* `etc.d`_ (`community respected
|
||||
analysis <https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul>`__)
|
||||
|
||||
Zookeeper
|
||||
---------
|
||||
|
||||
Summary:
|
||||
|
||||
Age: around 8 years
|
||||
|
||||
* Changelog was created in svn repository on aug 27, 2007.
|
||||
|
||||
License: Apache License 2.0
|
||||
|
||||
Approximate community size:
|
||||
|
||||
Features (overview):
|
||||
|
||||
- `Zab`_ based (paxos variant)
|
||||
- Reliable filesystem like-storage (see `zk data model`_)
|
||||
- Mature (and widely used) python client (via `kazoo`_)
|
||||
- Mature shell/REPL interface (via `zkshell`_)
|
||||
- Ephemeral nodes (filesystem entries that are tied to presence
|
||||
of their creator)
|
||||
- Self-cleaning trees (implemented in 3.5.0 via
|
||||
https://issues.apache.org/jira/browse/ZOOKEEPER-2163)
|
||||
- Dynamic reconfiguration (making upgrades/membership changes that
|
||||
much easier to get right)
|
||||
- https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html
|
||||
|
||||
Operability:
|
||||
|
||||
- Rolling restarts < 3.5.0 (to allow for upgrades to happen)
|
||||
- Starting >= 3.5.0, 'rolling restarts' are no longer needed (see
|
||||
mention of dynamic reconfiguration above)
|
||||
- Java stack experience required
|
||||
|
||||
Language written in: java
|
||||
|
||||
.. _kazoo: http://kazoo.readthedocs.org/
|
||||
.. _zkshell: https://pypi.python.org/pypi/zk_shell/
|
||||
.. _zk data model: http://zookeeper.apache.org/doc/\
|
||||
trunk/zookeeperProgrammers.html#ch_zkDataModel
|
||||
.. _Zab: https://web.stanford.edu/class/cs347/reading/zab.pdf
|
||||
|
||||
Packaged: yes (at least on ubuntu and fedora)
|
||||
|
||||
* http://packages.ubuntu.com/trusty/java/zookeeperd
|
||||
* https://apps.fedoraproject.org/packages/zookeeper
|
||||
|
||||
Consul
|
||||
------
|
||||
|
||||
Summary:
|
||||
|
||||
Age: around 1.5 years
|
||||
|
||||
* Repository changelog denotes added in april 2014.
|
||||
|
||||
License: Mozilla Public License, version 2.0
|
||||
|
||||
Approximate community size:
|
||||
|
||||
Features (overview):
|
||||
|
||||
- Raft based
|
||||
- DNS interface
|
||||
- HTTP interface
|
||||
- Reliable K/V storage
|
||||
- Suited for multi-datacenter usage
|
||||
- Python client (via `python-consul`_)
|
||||
|
||||
.. _python-consul: https://pypi.python.org/pypi/python-consul
|
||||
.. _consul: https://www.consul.io/
|
||||
|
||||
Operability:
|
||||
|
||||
* Go stack experience required
|
||||
|
||||
Language written in: go
|
||||
|
||||
Packaged: somewhat (at least on ubuntu and fedora)
|
||||
|
||||
* Ppa at https://launchpad.net/~bcandrea/+archive/ubuntu/consul
|
||||
* https://admin.fedoraproject.org/pkgdb/package/consul/ (?)
|
||||
|
||||
Etc.d
|
||||
-----
|
||||
|
||||
Summary:
|
||||
|
||||
Age: Around 1.09 years old
|
||||
|
||||
License: Apache License 2.0
|
||||
|
||||
Approximate community size:
|
||||
|
||||
Features (overview):
|
||||
|
||||
Language written in: go
|
||||
|
||||
Operability:
|
||||
|
||||
* Go stack experience required
|
||||
|
||||
Packaged: ?
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Place all functionality behind `tooz`_ (as much as possible) and let the
|
||||
operator choose which implementation to use. Do note that functionality that
|
||||
is not possible in all backends (for example consul provides a `DNS`_ interface
|
||||
that complements its HTTP REST interface) will not be able to be exposed
|
||||
through a `tooz`_ API, so this may limit the developer using `tooz`_ to
|
||||
implement some feature/s).
|
||||
|
||||
Compliance: further details about what each `tooz`_ driver must
|
||||
conform to (as in regard to how it operates, what functionality it must support
|
||||
and under what consistency, availability, and partition tolerance scheme
|
||||
it must operate under) will be detailed at: `240645`_
|
||||
|
||||
It is expected as the result of `240645`_ that
|
||||
certain existing `tooz`_ drivers will be deprecated and eventually removed
|
||||
after a given number of cycles (due to there inherent inability to meet the
|
||||
policy constraints created by that specification) so that the quality
|
||||
and consistency of there operating policy can be guaranteed (this guarantee
|
||||
reduces the divergence in implementations that makes plugins that much
|
||||
harder to diagnosis, debug, and validate).
|
||||
|
||||
.. Note::
|
||||
|
||||
Do note that the `tooz`_ alternative which needs to be understood
|
||||
is that `tooz`_ is a tiny layer around solutions mentioned above, which
|
||||
is an admirable goal (I guess I can say this since I helped make that
|
||||
library) but it does favor pluggability over picking one solution and
|
||||
making it better. This is obviously a trade-off that must IMHO **not** be
|
||||
ignored (since ``X`` solutions mean that it becomes that much harder to
|
||||
diagnose and fix upstream issues because ``X - Y`` solutions may not have
|
||||
the issue in the first place); TLDR: pluggability comes at a cost.
|
||||
|
||||
.. _DNS: http://www.consul.io/docs/agent/dns.html
|
||||
.. _tooz: http://docs.openstack.org/developer/tooz/
|
||||
.. _240645: https://review.openstack.org/#/c/240645/
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
- All the reviewers, code creators, PTL(s) of OpenStack?
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Mitaka
|
||||
- Introduced
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
Loading…
Reference in New Issue