diff --git a/specs/chronicles-of-a-dlm.rst b/specs/chronicles-of-a-dlm.rst new file mode 100644 index 0000000..0c6f9b3 --- /dev/null +++ b/specs/chronicles-of-a-dlm.rst @@ -0,0 +1,403 @@ +========================================== + Chronicles of a distributed lock manager +========================================== + +No blueprint, this is intended as a reference/consensus document. + +The various OpenStack projects have an ongoing requirement to perform +some set of actions in an atomic manner performed by some distributed set of +applications on some set of distributed resources **without** having those +resources end up in some corrupted state due those actions being performed on +them without the traditional concept of `locking`_. + +A `DLM`_ is one such concept/solution that can help (but not entirely +solve) these types of common resource manipulation patterns in distributed +systems. This specification will be an attempt at defining the problem +space, understanding what each project *currently* has done in regards of +creating its own `DLM`_-like entity and how we can make the situation better +by coming to consensus on a common solution that we can benefit from to +make everyone's lives (developers, operators and users of OpenStack +projects) that much better. Such a consensus being built will also +influence the future functionality and capabilities of OpenStack at large +so we need to be **especially** careful, thoughtful, and explicit here. + +.. _DLM: https://en.wikipedia.org/wiki/Distributed_lock_manager +.. _locking: https://en.wikipedia.org/wiki/Lock_%28computer_science%29 + +Problem description +=================== + +Building distributed systems is **hard**. It is especially hard when the +distributed system (and the applications ``[X, Y, Z...]`` that compose the +parts of that system) manipulate mutable resources without the ability to do +so in a conflict-free, highly available, and +scalable manner (for example, application ``X`` on machine ``1`` resizes +volume ``A``, while application ``Y`` on machine ``2`` is writing files to +volume ``A``). Typically in local applications (running on a single +machine) these types of conflicts are avoided by using primitives provided +by the operating system (`pthreads`_ for example, or filesystem locks, or +other similar `CAS`_ like operations provided by the `processor instruction`_ +set). In distributed systems these types of solutions do **not** work, so +alternatives have to either be invented or provided by some +other service (for example one of the many academia has created, such +as `raft`_ and/or other `paxos`_ variants, or services created +from these papers/concepts such as `zookeeper`_ or `chubby`_ or one of the +many `raft implementations`_ or the redis `redlock`_ algorithm). Sadly in +OpenStack this has meant that there are now multiple implementations/inventions +of such concepts (most using some variation of database locking), using +different techniques to achieve the defined goal (conflict-free, highly +available, and scalable manipulation of resources). To make things worse +some projects still desire to have this concept and have not reached the +point where it is needed (or they have reached this point but have been +unable to achieve consensus around an implementation and/or +direction). Overall this diversity, while nice for inventors and people +that like to explore these concepts does **not** appear to be the best +solution we can provide to operators, developers inside the +community, deployers and other users of the now (and every expanding) diverse +set of `OpenStack projects`_. + +.. _redlock: http://redis.io/topics/distlock +.. _pthreads: http://man7.org/linux/man-pages/man7/pthreads.7.html +.. _CAS: https://en.wikipedia.org/wiki/Compare-and-swap +.. _processor instruction: http://www.felixcloutier.com/x86/CMPXCHG.html +.. _paxos: https://en.wikipedia.org/wiki/Paxos_%28computer_science%29 +.. _raft: http://raftconsensus.github.io/ +.. _zookeeper: https://en.wikipedia.org/wiki/Apache_ZooKeeper +.. _chubby: http://research.google.com/archive/chubby.html +.. _raft implementations: http://raftconsensus.github.io/#implementations +.. _OpenStack projects: http://git.openstack.org/cgit/openstack/\ + governance/tree/reference/projects.yaml + +What has been created +--------------------- + +To show the current diversity let's dive slightly into what *some* of the +projects have created and/or used to resolve the problems mentioned above. + +Cinder +****** + +**Problem:** + +Avoid multiple entities from manipulating the same volume resource(s) +at the same time while still being scalable and highly available. + +**Solution:** + +Currently is limited to file locks and basic volume state transitions. Has +limited scalability and reliability of cinder under failure/load; has been +worked on for a while to attempt to create a solution that will fix some of +these fundamental issues. + +**Notes:** + +- For further reading/details these links can/may offer more insight. + + - https://review.openstack.org/#/c/149894/ + - https://review.openstack.org/#/c/202615/ + - https://etherpad.openstack.org/p/mitaka-cinder-volmgr-locks + - https://etherpad.openstack.org/p/mitaka-cinder-cvol-aa + - (and more) + +Ironic +****** + +**Problem:** + +Avoid multiple conductors from manipulating the same bare-metal +instances and/or nodes at the same time while still being scalable and +highly available. + +Other required/implemented functionality: + +* Track what services are running, supporting what drivers, and rebalance + work when service state changes (service discovery and rebalancing). +* Sync state of temporary agents instead of polling or heartbeats. + +**Solution:** + +Partition resources onto a hash-ring to allow for ownership to be scaled +out among many conductors as needed. To avoid entities in that hash-ring +from manipulating the same resource/node that they both may co-own a database +lock is used to ensure single ownership. Actions taken on nodes are performed +after the lock (shared or exclusive) has been obtained (a `state machine`_ +built using `automaton`_ also helps ensure only valid transitions +are performed). + +**Notes:** + +- Has logic for shared and exclusive locks and provisions for upgrading + a shared lock to an exclusive lock as needed (only one exclusive lock + on a given row/key may exist at the same time). +- Reclaim/take over lock mechanism via periodic heartbeats into the + database (reclaims is apparently a manual and clunky process). + +**Code/doc references:** + +- Some of the current issues listed at `pluggable-locking`_. + + - `Etcd`_ proposed @ `179965`_ I believe this further validates the view + that we need a consensus on a uniform solution around DLM (vs continually + having projects implement whatever suites there fancy/flavor of the week). + +- https://github.com/openstack/ironic/blob/master/ironic/conductor/task_manager.py#L20 +- https://github.com/openstack/ironic/blob/master/ironic/conductor/task_manager.py#L222 + +.. _state machine: http://docs.openstack.org/developer/ironic/dev/states.html +.. _automaton: http://docs.openstack.org/developer/automaton/ +.. _179965: https://review.openstack.org/#/c/179965 +.. _Etcd: https://github.com/coreos/etcd +.. _pluggable-locking: https://blueprints.launchpad.net/ironic/+spec/pluggable-locking + +Heat +**** + +**Problem:** + +Multiple engines working on the same stack (or nested stack of). The +ongoing convergence rework may change this state of the world (so in the +future the problem space might be slightly different, but the concept +of requiring locks on resources will still exist). + +**Solution:** + +Lock a stack using a database lock and disallow other engines +from working on that same stack (or stack inside of it if nested), +using expiry/staleness allow other engines to claim potentially +lost lock after period of time. + +**Notes:** + +- Liveness of stack lock not easy to determine? For example is an engine + just taking a long time working on a stack, has the engine had a network + partition from the database but is still operational, or has the engine + really died? + + - To resolve this a combination of an ``oslo.messaging`` ping used to + determine when a lock may be dead (or the owner of it is dead), if an + engine is non-responsive to pings/pongs after period of time (and its + associated database entry has expired) then stealing is allowed to occur. + +- Lacks *simple* introspection capabilities? For example it is necessary + to examine the database or log files to determine who is trying to acquire + the lock, how long they have waited and so on. + +- Lock releasing may fail (which is highly undesirable, *IMHO* it should + **never** be possible to fail releasing a lock); implementation does not + automatically release locks on application crash/disconnect/other but relies + on ping/pongs and database updating (each operation in this + complex 'stealing dance' may fail or be problematic, and therefore is not + especially simple). + +**Code/doc references:** + +- http://docs.openstack.org/developer/heat/_modules/heat/engine/stack_lock.html +- https://github.com/openstack/heat/blob/master/heat/engine/resource.py#L1307 + +Ceilometer and Sahara +********************* + +**Problem:** + +Distributing tasks across central agents. + +**Solution:** + +Token ring based on `tooz`_. + +**Notes:** + +Your project here +***************** + +Solution analysis +================= + +The proposed change would be to choose one of the following: + +- Select a distributed lock manager (one that is opensource) and integrate + it *deeply* into openstack, work with the community that owns it to develop + and issues (or fix any found bugs) and use it for lock management + functionality and service discovery... +- Select a API (likely `tooz`_) that will be backed by capable + distributed lock manager(s) and integrate it *deeply* into openstack and + use it for lock management functionality and service discovery... + +* `zookeeper`_ (`community respected + analysis `__) +* `consul`_ (`community respected + analysis `__) +* `etc.d`_ (`community respected + analysis `__) + +Zookeeper +--------- + +Summary: + +Age: around 8 years + +* Changelog was created in svn repository on aug 27, 2007. + +License: Apache License 2.0 + +Approximate community size: + +Features (overview): + +- `Zab`_ based (paxos variant) +- Reliable filesystem like-storage (see `zk data model`_) +- Mature (and widely used) python client (via `kazoo`_) +- Mature shell/REPL interface (via `zkshell`_) +- Ephemeral nodes (filesystem entries that are tied to presence + of their creator) +- Self-cleaning trees (implemented in 3.5.0 via + https://issues.apache.org/jira/browse/ZOOKEEPER-2163) +- Dynamic reconfiguration (making upgrades/membership changes that + much easier to get right) + - https://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html + +Operability: + +- Rolling restarts < 3.5.0 (to allow for upgrades to happen) +- Starting >= 3.5.0, 'rolling restarts' are no longer needed (see + mention of dynamic reconfiguration above) +- Java stack experience required + +Language written in: java + +.. _kazoo: http://kazoo.readthedocs.org/ +.. _zkshell: https://pypi.python.org/pypi/zk_shell/ +.. _zk data model: http://zookeeper.apache.org/doc/\ + trunk/zookeeperProgrammers.html#ch_zkDataModel +.. _Zab: https://web.stanford.edu/class/cs347/reading/zab.pdf + +Packaged: yes (at least on ubuntu and fedora) + +* http://packages.ubuntu.com/trusty/java/zookeeperd +* https://apps.fedoraproject.org/packages/zookeeper + +Consul +------ + +Summary: + +Age: around 1.5 years + +* Repository changelog denotes added in april 2014. + +License: Mozilla Public License, version 2.0 + +Approximate community size: + +Features (overview): + +- Raft based +- DNS interface +- HTTP interface +- Reliable K/V storage +- Suited for multi-datacenter usage +- Python client (via `python-consul`_) + +.. _python-consul: https://pypi.python.org/pypi/python-consul +.. _consul: https://www.consul.io/ + +Operability: + +* Go stack experience required + +Language written in: go + +Packaged: somewhat (at least on ubuntu and fedora) + +* Ppa at https://launchpad.net/~bcandrea/+archive/ubuntu/consul +* https://admin.fedoraproject.org/pkgdb/package/consul/ (?) + +Etc.d +----- + +Summary: + +Age: Around 1.09 years old + +License: Apache License 2.0 + +Approximate community size: + +Features (overview): + +Language written in: go + +Operability: + +* Go stack experience required + +Packaged: ? + +Proposed change +=============== + +Place all functionality behind `tooz`_ (as much as possible) and let the +operator choose which implementation to use. Do note that functionality that +is not possible in all backends (for example consul provides a `DNS`_ interface +that complements its HTTP REST interface) will not be able to be exposed +through a `tooz`_ API, so this may limit the developer using `tooz`_ to +implement some feature/s). + +Compliance: further details about what each `tooz`_ driver must +conform to (as in regard to how it operates, what functionality it must support +and under what consistency, availability, and partition tolerance scheme +it must operate under) will be detailed at: `240645`_ + +It is expected as the result of `240645`_ that +certain existing `tooz`_ drivers will be deprecated and eventually removed +after a given number of cycles (due to there inherent inability to meet the +policy constraints created by that specification) so that the quality +and consistency of there operating policy can be guaranteed (this guarantee +reduces the divergence in implementations that makes plugins that much +harder to diagnosis, debug, and validate). + +.. Note:: + + Do note that the `tooz`_ alternative which needs to be understood + is that `tooz`_ is a tiny layer around solutions mentioned above, which + is an admirable goal (I guess I can say this since I helped make that + library) but it does favor pluggability over picking one solution and + making it better. This is obviously a trade-off that must IMHO **not** be + ignored (since ``X`` solutions mean that it becomes that much harder to + diagnose and fix upstream issues because ``X - Y`` solutions may not have + the issue in the first place); TLDR: pluggability comes at a cost. + +.. _DNS: http://www.consul.io/docs/agent/dns.html +.. _tooz: http://docs.openstack.org/developer/tooz/ +.. _240645: https://review.openstack.org/#/c/240645/ + +Implementation +============== + +Assignee(s) +----------- + +- All the reviewers, code creators, PTL(s) of OpenStack? + +Work Items +---------- + +Dependencies +============ + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Mitaka + - Introduced + +.. note:: + + This work is licensed under a Creative Commons Attribution 3.0 Unported License. + http://creativecommons.org/licenses/by/3.0/legalcode