diff --git a/doc/source/conf.py b/doc/source/conf.py index 260bdad..f27b617 100755 --- a/doc/source/conf.py +++ b/doc/source/conf.py @@ -23,6 +23,7 @@ sys.path.insert(0, os.path.abspath('../..')) # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = [ 'sphinx.ext.autodoc', + 'sphinx.ext.graphviz', 'oslosphinx', 'yasfb', ] diff --git a/specs/ocata/mechanism-to-prevent-race-conditions.rst b/specs/ocata/mechanism-to-prevent-race-conditions.rst new file mode 100644 index 0000000..accd508 --- /dev/null +++ b/specs/ocata/mechanism-to-prevent-race-conditions.rst @@ -0,0 +1,290 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +==================================== +Mechanism to Prevent Race Conditions +==================================== + +https://blueprints.launchpad.net/manila/+spec/eliminate-race-conditions + +This proposal is to develop a general solution for preventing race +conditions which will work across services and in deployments where +there are multiple copies of the same services (commonly known as +Active/Active HA deployments). + +The focus is on keeping all state in the database, and protecting changes to +database state using briefly-held locks. Also concurrent operations which +are mutually exclusive should fail as early as possible with a helpful error +code to simplify the retry logic of upper layers. + +Problem description +=================== + +Certain operations in Manila should not be allowed to proceed in parallel, +because the result of one operation would prevent the other operation from +completing successfully. + +For example, taking a snapshot of a share cannot happen simultaneously while +deleting that share. Either the snapshot must occur first, which prevents the +delete -- or the delete must occur first, which prevents the snapshot. +Unfortunately, not enough state is stored in the database to prevent these +operations from racing with each other, so in practice two API calls can both +proceed through the API service to the share manager, where eventually an +error will occur and one or both operations will fail mysteriously. + +There are multiple scenarios like the above where undefined behavior results. +This specification does not attempt to enumerate all of them because the goal +is to describe a mechanism for fixing these kinds of issues rather than +explicitly fixing all such issues. Generally speaking, race conditions should +be treated as bugs, but up until now Manila has lacked the tools to fix these +bugs reliably. + +Use cases +========= + +Specific cases: + +* Two snapshot operations should not be able to occur at the same time. One + must complete before the second can begin. This ensures that snapshots occur + in a known order, and prevents the useless situation of having 2 identical + snapshots. + +* Taking a snapshot of a share should prevent a delete of that share. + +* Valid changes to access rules should always be accepted regardless of the + state of the existing access rules. Although rules are applied to the backend + asynchronously, it's valid to add multiple rules faster than the system can + apply them and expect Manila to catch up. + +General cases: + +* These guarantees must be enforceable using a database running in a clustered + configuration such as Galera. This prevents obvious solutions such as relying + on DB row-level locking. + +* These guarantees must be enforceable while running multiple copies of Manila + services, including multiple API, scheduler, and share manager services. + This prevents obvious solutions like in-process locks. + +* These guarantees must be enforceable while running in a distributed + configuration where cooperating services are on different nodes (physical, + VMs, or containers). This prevents our existing approach of using file + locks. Even though network-based file locking solutions exist, they represent + a single point of failure and are unacceptable in properly distributed + environments. + +* The Manila services should be able to automatically and gracefully recover + from crashes and other unplanned downtime. This means that implicit state + should be avoided and because long-held locks are implicit state they should + be avoided in favor of short-held locks with explicit state. + +Proposed change +=============== + +More transitional states will be added so that operations which can conflict +with other operations can be explicitly detected by looking at the share +state. This spec only proposes one specific new state to address the races +involving snapshots, but more generally provides a framework for resolving +similar races as they are discovered. + +Transitions between states will always be done while holding a distributed +lock -- a lock implemented by a distributed lock manager (DLM). Use of +distributed locks ensures that all services see the same locking state even if +services run on different nodes, and even across transient failures such as +node failures and network partitions. The lock will be held only for the +duration of the database test-and-set operation to minimize lock contention. + +No locks will be held during calls from the share manager to the share driver. +Mutual exclusion between driver calls will be achieved with state checks. + +No locks will be held during RPC calls or casts. + +Alternatives +------------ + +The approach used by Cinder which relies on elaborate SQL calls to +compare-and-swap fields was considered but rejected for the following reasons: + +* The code in Cinder can't be shared with Manila because it relies on OVO (Oslo + Versioned Objects) + +* Not enough people understand how it works so it's likely to be hard to + maintain. + +* Cinder's compare-and-swap approach limits the kind of state changes you can + make because updating multiple tables atomically is impossible. Locks + don't suffer from this restriction. + +Data model impact +----------------- + +New states will be added: + +* Snapshotting + +* States for access rules covered in `Access rules spec`_ + + +.. graphviz:: + + digraph share_states { + label="Share States" + // Transitional States + creating[shape=hexagon]; + manage_starting[shape=hexagon]; + deleting[shape=hexagon]; + snapshotting[shape=hexagon,color=gold4, fontcolor=gold4]; + migrating[shape=hexagon]; + shrinking[shape=hexagon]; + extending[shape=hexagon]; + unmanage_starting[shape=hexagon]; + replication_change[shape=hexagon]; + // Error states + error[color=red4, fontcolor=red4]; + shrinking_error[color=red4, fontcolor=red4]; + shrinking_possible_data_loss_error[color=red4, fontcolor=red4]; + extending_error[color=red4, fontcolor=red4]; + unmanage_error[color=red4, fontcolor=red4]; + manage_error[color=red4, fontcolor=red4]; + error_deleting[color=red4, fontcolor=red4]; + // Other states + new[color=blue, fontcolor=blue]; + available[color=darkgreen, fontcolor=darkgreen]; + deleted[shape=box, color=navy, fontcolor=navy]; + unmanaged[shape=box, color=navy, fontcolor=navy]; + // User requested transitions + new -> creating[label="create"]; + new -> manage_starting[label="manage"]; + available -> deleting[label="delete"]; + available -> snapshotting[label="create snapshot", color=gold4, fontcolor=gold4]; + available -> migrating[label="migrate"]; + available -> shrinking[label="shrink"]; + available -> extending[label="extend"]; + available -> unmanage_starting[label="unmanage"]; + available -> replication_change[label="add replica"]; + // Automatic transitions + creating -> available[label="success", color=darkgreen, fontcolor=darkgreen]; + deleting -> deleted[label="success", color=darkgreen, fontcolor=darkgreen]; + snapshotting -> available[label="success", color=darkgreen, fontcolor=darkgreen]; + manage_starting -> available[label="success", color=darkgreen, fontcolor=darkgreen]; + unmanage_starting -> unmanaged[label="success", color=darkgreen, fontcolor=darkgreen]; + extending -> available[label="success", color=darkgreen, fontcolor=darkgreen]; + shrinking -> available[label="success", color=darkgreen, fontcolor=darkgreen]; + replication_change -> available[label="success", color=darkgreen, fontcolor=darkgreen]; + // Reset transitions + error -> available[label="reset"]; + shrinking_error -> available[label="reset"]; + extending_error -> available[label="reset"]; + unmanage_error -> available[label="reset"]; + manage_error -> available[label="reset"]; + error_deleting -> available[label="reset"]; + // Error transitions + creating -> error[label="fail", color=red4, fontcolor=red4]; + migrating -> error[label="fail", color=red4, fontcolor=red4]; + shrinking -> shrinking_error[label="fail", color=red4, fontcolor=red4]; + shrinking -> shrinking_possible_data_loss_error[label="fail", color=red4, fontcolor=red4]; + extending -> extending_error[label="fail", color=red4, fontcolor=red4]; + unmanage_starting -> unmanage_error[label="fail", color=red4, fontcolor=red4]; + manage_starting -> manage_error[label="fail", color=red4, fontcolor=red4]; + snapshotting -> error[label="fail", color=red4, fontcolor=red4]; + deleting -> error_deleting[label="fail", color=red4, fontcolor=red4]; + } + +REST API impact +--------------- + +New states will be visible through any API that shows states. Also new +error conditions will become possible as we detect races earlier and +report them directly. + +The behavioral changes related to locking will not be microversioned, as it +won't be possible or desirable to emulate the old behavior once the changes +are implemented. However in cases where new states are added, those changes +will be microversioned so that clients which depend on the new states can +detect that the server supports them. + +Driver impact +------------- + +None + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +Distributed locking is expected to moderately slow down state changes. Also +adding more state changes will slow down operations that require them. + +Other deployer impact +--------------------- + +Requirement to deploy and configure suitable `Tooz`_ backend. Since Manila will +depend on tooz for correctness, tooz backends that fail to meet the API +contract won't be suitable. + +Developer impact +---------------- + +This will be significant. Developers will need to follow the new model for +all features that involve state changes. Also care will be needed with locks +to avoid deadlock situations. Holding locks for very limited time will help +avoid deadlocks but in case 2 locks are ever held at the same time, they need +to be deadlock-proofed by establishing a lock order. + +Implementation +============== + +Assignee(s) +----------- + bswartz + +Work Items +---------- + +* Add snapshotting state + +* Complete tooz integration + +* Wrap state changes with tooz locks + +Dependencies +============ + +Tooz + +Testing +======= + +Existing tests will help ensure no regressions, but to detect race +conditions we need rally tests or similarly high-concurrency tests. + +Documentation Impact +==================== + +Admin guide - need to document tooz requirements. + +Developer reference - need to document state machines and locking protocol + +References +========== + +_`Access rules spec`: https://review.openstack.org/#/c/399049/ + +_`Tooz` integration: https://review.openstack.org/#/c/318336/