Update Protection Plugin Design

Implements: blueprint protection-plugin-is-design Change-Id: Iffd26f3a85346f7c29b93bc97d23d3472acb1f0f
2016-05-19 14:39:44 +03:00 · 2016-05-19 14:39:44 +03:00 · f35b04ed7c
parent 09162ed50c
commit f35b04ed7c
4 changed files with 508 additions and 65 deletions
--- a/doc/images/protection-service/activities-links.png
+++ b/doc/images/protection-service/activities-links.png
--- a/doc/source/specs/pluggable_protection_provider.rst
+++ b/doc/source/specs/pluggable_protection_provider.rst
@ -4,51 +4,185 @@

 http://creativecommons.org/licenses/by/3.0/legalcode

+.. raw:: html
+
+    <style>
+        .red {color:#d32f2f; font-weight: bold;}
+        .green {color:#4caf50; font-weight: bold;}
+        .yellow {color:#fbc02d; font-weight: bold;}
+        .indigo {color:#536dfe; font-weight: bold;}
+    </style>
+
+.. role:: red
+.. role:: green
+.. role:: yellow
+.. role:: indigo
+
 ==========================================
 Pluggable Protection Provider
 ==========================================

-https://blueprints.launchpad.net/smaug/+spec/operation-engine-design
+https://blueprints.launchpad.net/smaug/+spec/protection-plugin-is-design

-Problem Description
+Protection Provider
 ===================

-Even though we allow each provider to be implemented in any way it pleases we
-foresee that most providers will want to be able share code between them.
-We would also like for a user to be able to easily extend the ProtectionProvider
-that will be provided by default.
+Protection Provider is a user-facing, configurable, pluggable entity, that
+supplies the answer for the questions: "how to" and "where to". By composing
+different bank-store (responsible for the "where to") and different *Protection
+Plugins* (each responsible for the "how to"). The Protection Provider is
+configurable, both in the terms of bank and protection plugins composition, and
+in their configuration.

-Proposed Change
-===============
+The protection provider will contain internally, a map between any registered
+*Protectable* (OpenStack resource type) and a corresponding *Protection
+Plugin*, which is used for operations related to any appropriate resource.

-As as solution we propose the *Pluggable Protection Provider*.
+There are 3 resource operations a *Protection Provider* supports, and any
+*Protection Plugin* needs to implement. These operations usually act on
+numerous resources, and the *Protection Provider* infrastructure is responsible
+for using the corresponding *Protection Plugin* implemenation, for each
+resource. The *Protection Provider* is reponsible for initiating a DFS traverse
+of the resource graph, building tasks for each of the resources, and linking
+them in respect of the execution order and dependency.

-The *Pluggable Protection Provider* will be the reference implementation
-protection provider. It's purpose is to be fully pluggable and extandable so
-that only extream use cases will need to implement their own Protection Provider
-from scratch.
+#. **Protect**: the protection provider will traverse the selected resources
+   from the resource graph
+#. **Restore**: the protection provider will traverse the resource graph saved
+   in the checkpoint
+#. **Delete**: the protection provider will traverse the resource graph saved
+   in the checkpoint

-The protection provider will contain internally a map between any registered
-*Protectable* and a corrosponding *Protection Plugin*. When the pluggable
-protection provider is asked to perform an action, it will walk over the
-graph and pass a context object to the appropriate plugin whenever a node is
-encountered.
-
-The resource graph is traversed in with DFS. When a node is first encountered
-the protection manager gets the plugin for the appropriate resource type, builds
-a context and passes it to the plugins `get_pre_task()` method. The plugin can
-return any tasks that it wants added to the task list. When all of a node
-childrens have been visited the `get_pre_task()` is called. The task returned
-from this method will also be added to the task list but is also guranteed to
-execute after all the child node's tasks have finished. Any of the methods can
-return `None` if they don't want any action performed.
-
-After the entire grap has been traversed the Protection Provider will return
-the task lists which will be queued and than executed according to the
+After the entire graph has been traversed, the Protection Provider will return
+the task flow which will be queued and then executed according to the
 executor's policy. When all the tasks are done the operation is considered
 complete.

-This scheme decouples the tree structure form the task execution. A plugin that
+Protection Provider Configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Protection Providers are loaded from configuration files, placed in the
+directory specified by the ``provider_config_dir`` configuration option (by
+default: ``/etc/smaug/providers.d``). Each provider configuration file must
+bear the ``.conf`` suffix and contain a ``[provider]`` section. This section
+specifies the following configuration:
+
+#. ``name``: the display name of the protection provider
+#. ``id``: unique identifier
+#. ``description``: textual description
+#. ``bank``: path to the bank plugin
+#. ``plugin``: path to a protection plugin. Should be specified multiple times
+   for multiple protection plugins. Every *Protectable* **must** have a
+   corresponding *Protection Plugin* to support it.
+
+Additionally, the provider configuration file can include other section
+(besides the ``[provider]`` section), to be used as configuration for each bank
+or protection plugin.
+
+For example::
+
+  [provider]
+  name = Foo
+  id = 2e0c8826-81d6-44f5-bbe5-8f46a98c5845
+  description = Example Protection Provider
+  bank = smaug.protections.smaug-swift-bank-plugin
+  plugin = smaug.protections.smaug-volume-protection-plugin
+  plugin = smaug.protections.smaug-image-protection-plugin
+  plugin = smaug.protections.smaug-server-protection-plugin
+  plugin = smaug.protections.smaug-project-protection-plugin
+
+  [swift_client]
+  bank_swift_auth_url = http://10.0.0.10:5000
+  bank_swift_user = admin
+  bank_swift_key = password
+
+Protection Plugin
+=================
+
+A *Protection Plugin* is a component responsible for the implementation of
+operations (protect, restore, delete) of one or more *Protectable* (i.e
+resource type). When writing a *Protection Plugin*, the following needs to be
+defined:
+
+#. Which resources does the protection plugin support
+#. What is the schema of parameters for each operation
+#. What is the schema of information the protection plugin stores in a
+   Checkpoint
+#. The implementation of each operation
+
+Protection Plugin Operation Activities
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+*Protection Plugin* defines how to protect, restore, and delete resources. In
+order to specify the detailed flow of each operation, a *Protection Plugin*
+needs to implement numerous 'hooks'. These hooks, named *Activities*, differ
+from one another by their time of execution in respect to other activities,
+either of the same resource, or other resources. 
+
+#. **PreActivity**: invoked before any activity for this resource and dependent
+   resources has begun
+#. **ParallelActivity**: invoked after the resource *PreActivity* is complete,
+   regardless of the dependent resources' activities.
+#. **PostActivity**: invoked after all of the resource's activities are
+   complete, and the dependent resources' *PostActivities* are complete
+
+For example, a Protection Plugin for Nova servers, might implement a protect
+operation by using *PreActivity* to contact a guest agent, in order to complete
+database and operation system transactions, use *ParallelActivity* to backup
+the server metadata, and use *PostActivity* to contact a guest agent, in order
+to resume transactions.
+
+Practically, the protection plugin may implement methods in the form of::
+
+  activity_<operation_type>_<activity_type>
+
+Where:
+
+* ``operation_type`` is one of: ``protect``, ``restore``, ``delete``
+* ``activity_type`` is one of: ``pre``, ``post``, ``parallel``
+
+Notes:
+
+* Unimplemented methods are practically no-op
+* Each such method receives as parameters: ``checkpoint``, ``context``,
+  ``resource``, and ``parameters`` objects
+* These methods may return immediately, or use ``yield``. In the case ``yield``
+  is used, the Protection Provider infrastructure is responsible for
+  periodically call ``next()``, in order to "poll". This is extremely useful in
+  cases where asynchronous operations are initiated (such as Cinder volume
+  creation), but polling must be performed in order to decide when the
+  operation is complete, and whether it is successful or not. For example:
+
+::
+
+  def activity_protect_parallel(self, checkpoint, context, resource, parameters):
+      id = start_operation( ... )
+      while True:
+          status = get_status(id)
+          if status == 'error':
+              raise Exception
+          elif status == 'success':
+              return
+          else:
+              yield
+
+.. figure:: https://raw.githubusercontent.com/openstack/smaug/master/doc/images/protection-service/activities-links.png
+    :alt: Activities Links
+    :align: center
+
+    Activities Links
+
+    :green:`Green`: link of the parent resource PreActivity to the child
+    resource PreActivity
+
+    :yellow:`Yellow`: link of the resource PreActivity to ParallelActivity
+
+    :red:`Red`: link of the resource ParallelActivity to PostActivity
+
+    :indigo:`Indigo`: link of the child resource PostActivity to the parent
+    resource PostActivity
+
+This scheme decouples the tree structure from the task execution. A plugin that
 handles multiple resources or that aggregates mutiple resources to one task can
 use this mechanism to only return tasks when appropriate for it's scheme.

--- a/doc/source/specs/protection-service/activities-links.svg
+++ b/doc/source/specs/protection-service/activities-links.svg
--- a/doc/source/specs/protection-service/protection-service.rst
+++ b/doc/source/specs/protection-service/protection-service.rst
@ -10,24 +10,34 @@ Protection Service Basics

 https://bugs.launchpad.net/smaug/+bug/1529199

-Protection Service is a component of smaug (an openstack project working as a service for data protection), which is responsible to execute protect/restore/other actions on operations (triggered plans).
+Protection Service is a component of smaug (an openstack project working as a
+service for data protection), which is responsible to execute
+protect/restore/other actions on operations (triggered plans).

-Architecturally, it acts as a RPC server role for smaug API service to actually execute the actions on triggered operations.
+Architecturally, it acts as a RPC server role for smaug API service to actually
+execute the actions on triggered operations.

-It's also the role who actually cooperates with protection plugins provided by providers.  It will load providers (composed by a series of plugins) and thus manage them.
+It's also the role who actually cooperates with protection plugins provided by
+providers.  It will load providers (composed by a series of plugins) and thus
+manage them.

-Internally, protection service will construct work flow for each operation action execution, where tasks in work flow will be linked to a graph by resource dependency and thus be executed on parallel or linearly according to the graph task flow.
+Internally, protection service will construct work flow for each operation
+action execution, where tasks in work flow will be linked to a graph by
+resource dependency and thus be executed on parallel or linearly according to
+the graph task flow.

 RPC interfaces
 ================================================

 .. image:: https://raw.githubusercontent.com/openstack/smaug/master/doc/images/protection-service/protection-architecture.png

-From the module graph, protection service basically provide following RPC calls:
+From the module graph, protection service basically provide following RPC
+calls:

 Operation RPC:
 --------------------
-**execute_operation(backup_plan:Bac,upPlan, action:Action):** where action could be protect or restore
+**execute_operation(backup_plan:BackupPlan, action:Action):** where action
+could be protect or restore

 Provider RPC:
 -------------
@ -51,65 +61,90 @@ Main Concept

 Protection Manager
 ------------------
-Endpoint of the RPC server, which will handle Operation RPC calls and dispatch other RPC calls to corresponding components.
+Endpoint of the RPC server, which will handle Operation RPC calls and dispatch
+other RPC calls to corresponding components.

-It will produce a graph work flow for each operation execution, and have the work flow to be executed through its work flow engine.
+It will produce a graph work flow for each operation execution, and have the
+work flow to be executed through its work flow engine.

 ProviderRegistry
 ----------------

-Entity to manage multiple providers, which will load provider definitions on init from config files and maintain them in memory map.
+Entity to manage multiple providers, which will load provider definitions on
+init from config files and maintain them in memory map.

-It will actually handle RPC related to provider management, like list_providers() or show_provider().
+It will actually handle RPC related to provider management, like
+list_providers() or show_provider().

 CheckpointCollection
 --------------------

-Entity to manage checkpoints, which provides CRUD interfaces to handle checkpoint.  As checkpoint is a smaug internal entity, one checkpoint operation is actually composed by combination of serveral BankPlugin atomic operations.
+Entity to manage checkpoints, which provides CRUD interfaces to handle
+checkpoint. As checkpoint is a smaug internal entity, one checkpoint operation
+is actually composed by combination of serveral BankPlugin atomic operations.

-Take create_checkpoint as example, it will first acquire write lease (there will be detailed **lease** design doc) to avoid conflict with GC deletion, then it needs create key/value for checkpoint itself. After that, it will build multiple indexes for easier list checkpoints.
+Take create_checkpoint as example, it will first acquire write lease (there
+will be detailed **lease** design doc) to avoid conflict with GC deletion, then
+it needs create key/value for checkpoint itself. After that, it will build
+multiple indexes for easier list checkpoints.

 Typical scenario
 ======================================
-A typical scenario will start from a triggered operation being sent through RPC call to Protection Service.
+A typical scenario will start from a triggered operation being sent through RPC
+call to Protection Service.

-Let's take action protect as the example and analyze the sequence together with the class graph:
+Let's take action protect as the example and analyze the sequence together with
+the class graph:

 .. image:: https://raw.githubusercontent.com/openstack/smaug/master/doc/images/protection-service/protect-rpc-call-seq-diagram.png

 1. Smaug **Operation Engine**
 ------------------------------
-who is responsible for triggering operation according to time schedule or events, will call RPC call of Protection Service: execute_operation(backup_plan:Bac,upPlan, action:Action);
+who is responsible for triggering operation according to time schedule or
+events, will call RPC call of Protection Service:
+execute_operation(backup_plan:Bac,upPlan, action:Action);

 2. ProtectionManager
 ------------------------
-who plays as one of the RPC server endpoints, and will handle this RPC call by following sequence:
+who plays as one of the RPC server endpoints, and will handle this RPC call by
+following sequence:

 2.1 CreateCheckpointTask:
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This task will be the start point task of the graph flow. This task will call
+the unique instance of class
+**Checkpoints**:create_checkpoint(plan:ProtectionPlan), to create one
+checkpoint to persist the status of the action execution.

-This task will be the start point task of the graph flow. This task will call the unique instance of class **Checkpoints**:create_checkpoint(plan:ProtectionPlan), to create one checkpoint to persist the status of the action execution.
+The instance of **Checkpoints** will retrieve the **Provider** from input
+parameter **BackupPlan**, and get the unique instance of **BankPlugin**.

-The instance of **Checkpoints** will retrieve the **Provider** from input parameter **BackupPlan**, and get the unique instance of **BankPlugin**.
+While **BankPlugin** provides interfaces for CRUD key/values in **Bank** and
+lease interfaces to avoid write/delete conflict, **Checkpoints** is responsible
+for the whole procedure of create checkpoint, including grant lease,
+create key/value of checkpoint, build indexes etc. through composing calls to
+**BankPlugin**

-While **BankPlugin** provides interfaces for CRUD key/values in **Bank** and lease interfaces to avoid write/delete conflict, **Checkpoints** is responsible for the whole procedure of create checkpoint, including grant lease, create key/value of checkpoint, build indexes etc. through composing calls to **BankPlugin**
+2.2 Call ProtectionProvider to build the resource flow
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This task is built by walking through **resource tree** (see
+**Pluggable protection provider** doc), which will return a graph flow.
+The result graph flow is composed of tasks representing the activities of the
+ProtectionPlugin for each resource, and the links between the tasks according
+to the activities type, and resource dependencies.

-2.2 call ProtectionProvider to build sub task flow:
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The graph flow returned by ProtectionProvider would be added to the top layer
+task flow, right behind the start point task **CreateCheckpointTask**, and will
+be executed with parallel engine.

-This task is built by walking through **resource tree** (see **Pluggable protection provider** doc), which will return a graph flow. The result graph flow could be composed by single task or multiple tasks built with dependencies.
+The protection plugin is responsible for storing the ProtectionData (backup
+id, snapshot id, image id, etc) into the Bank under the corresponding
+**ProtectionDefinition**.

-The graph flow returned by ProtectionProvider would be added to the top layer task flow, right behind the start point task **CreateCheckpointTask**, and will be executed with parallel engine. 
-
-When it comes to each resource task returned from ProtectionProvider task flow building, each task will call protect() interface of related ProtectionPlugin.  There, we will get ProtectionData as the return result, which describes the restore target (where the resource is protected to) and the id of the protection data (backup id, snapshot id, image id etc., anything).  This ProtectionData will be persisted into Bank under the corresponding **ProtectionDefinition**.
-
-2.3 SyncCheckpointStatusTask:
+2.3 CompleteCheckpointTask
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-This task is added into the top layer task flow right after the task flow built form ProtectProvider, which will be executed only when all tasks/flows ahead of it have been executed successfully.
-
-This task will list all **ProtectionDefinition** under one checkpoint, for each ProtectionDefinition: if its ProtectionData status hasn't turned to be available, this task will check its protection_id status (backup, snapshot, replication status) by calling ProtectionPlugin.get_protection_status(). If any ProtectionData turns to be available, its status will be updated to the corresponding ProtectionDefinition and won't be checked next time.
-
-Since each protect action will take some time to achieve finished status (ProtectionData turns to be available), this task could be executed periodically or only executed once before timeout.
-
-Until the operation timeout, this task will get the final status of this checkpoint: if all protect actions have achieved finished status, then the checkpoint is finished; otherwise, the checkpoint is broken and will be abandoned.
+This task is added into the top layer task flow right after the task flow built
+form ProtectProvider, which will be executed only when all tasks ahead of it
+have been completed successfully. This task will update the checkpoint status
+to be available, and commit it to the bank.