Manual cleaning

Ironic already provides support for automated cleaning. This specification describes support for manual cleaning, including an API for operators to specify a list of clean steps to perform on a node from the MANAGEABLE state. Clean steps that are destructive and long running such as configuring RAID or doing burn in, are good candidates for manual cleaning instead of automated cleaning. This feature was formerly called 'zapping'. Change-Id: Iea975cfc2effc2d8be186294b88d85d8f2ace7b2 blueprint: manual-cleaning
2015-10-06 02:22:09 +00:00 · 2015-10-06 02:22:09 +00:00 · 0ea28bec96
parent 529d9f7631
commit 0ea28bec96
3 changed files with 440 additions and 304 deletions
--- a/specs/approved/implement-zapping-states.rst
+++ b/specs/approved/implement-zapping-states.rst
@ -1,291 +0,0 @@
-..
- This work is licensed under a Creative Commons Attribution 3.0 Unported
- License.
-
- http://creativecommons.org/licenses/by/3.0/legalcode
-
-==========================================
-Implement Zapping States
-==========================================
-
-https://blueprints.launchpad.net/ironic/+spec/implement-zapping-states
-
-Zapping encompasses all long running, destructive tasks an operator may
-want to take either between workloads, or before the first workload has been
-assigned to a node.
-
-
-Problem description
-===================
-
-* Operators need some long running work done on nodes before they can be
-  successfully provisioned.
-
-* Things like firmware updates, setting up new RAID levels, or burning in
-  nodes often need to be done before a user is given a server, but take
-  too long to reasonably do at deploy time.
-
-* Operators may want certain clean steps to only run on demand, rather than
-  every clean cycle. One example is a burn in test before nodes are made
-  AVAILABLE. By making clean_steps a subset of all possible zap steps,
-  operators can choose which steps will be run on every clean cycle, and
-  which will only be initiated by the operator.
-
-* Many of these tasks will provide useful scheduling hints to Nova once
-  hardware capabilities are introduced. Operators
-  could use these scheduling hints to create flavors, such as a nova compute
-  flavor that requires a node with RAID 1 for extra durability.
-
-Proposed change
-===============
-
-* Modify the provision state API call which will allow a node in MANAGEABLE
-  state to go to a ZAPPING state and perform a list of specified ZAPPING steps.
-  These will be provided to the API as a list of dictionaries encoded as JSON.
-
-* Add zapping steps to drivers, using the @clean_step decorator with a default
-  cleaning_priority of 0. This will ensure the step isn't run as part of the
-  automated cleaning between DELETED and AVAILABLE that happens in CLEANING.
-
-* The list of possible ZAPPING steps will be pulled from the list of functions
-  decorated with @clean_step, which is documented in [1].
-
-* Operators will be able to get a list of possible steps by querying
-  /nodes/<node_ident>/cleaning/all_steps. This will provide a superset of the
-  states listed in /nodes/<node_ident>/cleaning/clean_steps, which doesn't list
-  clean_steps with a cleaning_priority of 0.
-
-* When the conductor attempts to execute a zap step, it will call
-  execute_clean_step() on the driver responsible for that zap step.
-
-Alternatives
------------
-
-* We could make zap steps and clean steps mutually exclusive, simplifying
-  some of the API and possible confusion, but limiting zapping and requiring
-  a second, nearly identical API for executing individual CLEANING states or
-  duplicating cleaning steps as zap and clean steps. Nearly any step that
-  can be executed on demand via ZAPPING can be argued to be a necessary step
-  in CLEANING to provide a consistent platform. For example, if you use
-  ZAPPING to set up a RAID 10 on the node, you may want to ensure a clean
-  RAID 10 is presented to every client, and therefore would need to check
-  and possibly rebuild the RAID 10 in CLEANING. The same can be said for
-  firmware upgrade (tenants can change firmwares), etc.
-
-Data model impact
-----------------
-
-None
-
-REST API impact
---------------
-
-GET /nodes/<node_ident>/cleaning/all_steps
-
-* An API endpoint should be added to allow operators to see available
-  zapping steps. This will be similar to
-  /nodes/<node_ident>/cleaning/clean_steps, but will return all cleaning and
-  zapping steps, with the format as follows::
-
-    [{
-      // 'interface' is one of : 'power', 'management', 'deploy'
-      // 'step' is an opaque identifier used by the driver. Could be a driver
-      // function name, could be some function in the agent.
-      // 'cleaning_priority' is priority the step would be run at in cleaning.
-      'interface': 'interface',
-      'step': 'step',
-      'cleaning_priority': some_integer,
-      // a list of required arguments as strings that must be included in
-      // the PUT to the node's provision state API to move to ZAPPING
-      'required_args': []
-    },
-    ... more steps ...
-    ]
-
-
-* An example with a single step::
-
-    [{
-      'interface': 'management',
-      'step': 'configure_hardware_raid',
-      'required_args': ['raid_level']
-      'cleaning_priority': 0,
-    }]
-
-
-
-* If the driver interface can not synchronously get the list of clean steps
-  (eg, because a remote agent is used to determine available cleaning steps),
-  then the driver MUST cache the list of clean steps from the most recent
-  execution of said agent and return that. In the absence of such data, the
-  driver MAY raise an error, which should be translated by the API service into
-  an HTTP RETRY with an indication to the client as to when to retry using a
-  Retry-After HTTP header. If the driver interface can synchronously return the
-  cleaning steps, without relying on the hardware or a remote agent, it SHOULD
-  do so, though it MAY also rely on the aforementioned caching mechanism.
-
-PUT /v1/nodes/<node_ident>/states/provision
-
-* The API will allow users to put a node directly into zapping
-  provision_state with a PUT from MANAGEABLE state,
-  the same as how provision state is changed anywhere else in Ironic. On top
-  of the normal 'target_state': 'zap' , the PUT will require an argument
-  'zap_steps', which will be a list in the form::
-
-    'zap_steps': [{
-        'interface': 'management'
-        'step': 'configure_hardware_raid',
-        'raid_level': 10 // required kwarg
-        ... // more required kwargs (if applicable)
-      },
-      {
-        'interface': 'deploy'
-        'step': 'erase_devices'
-      }
-    }]
-
-
-  Only 'interface' and 'step' are required for all steps. Each step may
-  require additional kwargs, as noted above. The steps will be executed in the
-  order provided. If any step is missing a kwarg or has incorrect kwargs, the
-  node will go to ZAPFAIL with an appropriate error message.
-
-* In the above example, hardware RAID 10 would be configured by the management
-  driver, then all devices would be erased (in that order).
-
-* The API will be changed to prevent changing power state or provision state
-  while the node is in a ZAPPING state. A node in ZAPFAIL
-  state may have its power state changed via the API, because the operator will
-  likely need to restart the node to fix it.
-
-State Machine Impact
--------------------
-
-Implement/add the following parts of the state machine:
-
-* MANAGEABLE -> ZAPPING (zap)
-
-* ZAPPING -> MANAGEABLE (done)
-
-* ZAPPING -> ZAPFAIL (fail)
-
-* add ZAPFAIL -> ZAPPING (zap)
-
-* add ZAPFAIL -> MANAGEABLE (manage)
-
-Add 'zap' to states.VERBS.
-
-Client (CLI) impact
-------------------
-
-* Add an argument to the node-set-provision-state CLI called
-  '--zap-steps' that takes a single argument: a JSON file to read and pass to
-  the API, which has the same format as what is passed to the API for zapping.
-  If the input file is specified as '-', the CLI will read in from stdin, to
-  allow piping in the zap steps. Using '-' to signify stdin is common in Unix
-  utilities. '--zap-steps' will on be required if the requested provision state
-  is "zap", otherwise, it not allowed.
-
-RPC API impact
--------------
-
-Add do_node_clean to the RPC API, remove cleaning from the
-do_provisioning_action RPC API call, and use this same call for zapping.
-This should provide the cleanest API.
-
-Driver API impact
-----------------
-
-None
-
-Nova driver impact
------------------
-
-states.py should be synced to the Nova driver, so Nova is aware of zap* states.
-
-Security impact
---------------
-
-None
-
-Other end user impact
---------------------
-
-None
-
-Scalability impact
------------------
-
-None
-
-Performance Impact
------------------
-
-None
-
-Other deployer impact
---------------------
-
-None
-
-Developer impact
----------------
-
-None
-
-Implementation
-==============
-
-Assignee(s)
-----------
-
-Primary assignee:
-  JoshNang
-
-Work Items
----------
-
-* Add API checks for zap states and allow "zap" as a
-  provision target action, which will trigger the manageable -> zapping
-  transition or zapfail -> zapping transition.
-
-* Bump API microversion to add zapping states and "zap" verb.
-
-* Modify the cleaning flow to allow zapping
-
-* Change execute_clean_steps and get_clean_steps in any asynchronous driver
-  to cache clean/zap steps and return cached clean/zap steps whenever possible.
-
-* Allow APIs to return a Retry-After HTTP header and empty response, in
-  response to a certain exception from drivers.
-
-Dependencies
-============
-
-* get_clean_steps API https://review.openstack.org/#/c/159322
-
-
-Testing
-=======
-
-* Drivers implementing zapping will be expected to test their added
-  features.
-
-
-Upgrades and Backwards Compatibility
-====================================
-
-None
-
-Documentation Impact
-====================
-
-The overlap between cleaning and zapping should be clearly defined.
-
-
-References
-==========
-
-1: https://review.openstack.org/#/c/102685/
-
-2: https://review.openstack.org/#/c/150073/
--- a/specs/approved/manual-cleaning.rst
+++ b/specs/approved/manual-cleaning.rst
@ -0,0 +1,440 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+===============
+Manual cleaning
+===============
+
+https://blueprints.launchpad.net/ironic/+spec/manual-cleaning
+
+Manual cleaning (as opposed to automated cleaning) encompasses all long
+running, manual, destructive tasks an operator may want to perform either
+between workloads, or before the first workload has been assigned to a node.
+
+This feature had previously been called `"Zapping"
+<https://review.openstack.org/#/c/185122/>`_ and this specification copies
+a lot of the zapping specification. (Thank you Josh Gachnang!)
+
+
+Problem description
+===================
+
+`Automated cleaning <http://specs.openstack.org/openstack/ironic-specs/specs/kilo-implemented/implement-cleaning-states.html>`_
+has been available in ironic since the kilo cycle. It lets operators
+choose which clean steps are automatically done prior to the first
+time a node is deployed and each time after a node is released.
+
+However, operators may want certain operations or tasks to only run on demand,
+rather than in every clean cycle. Things like firmware updates, setting up new
+RAID levels, or burning in nodes often need to be done before a user is given
+a server, but take too long to reasonably do at deploy time.
+
+Many of the above tasks could provide useful scheduling hints to nova once
+hardware capabilities are introduced. Operators could use these scheduling
+hints to create flavors, such as a nova compute flavor that requires a node
+with RAID 1 for extra durability.
+
+
+Proposed change
+===============
+
+Instead of adding new ZAP* states to the state machine to distinguish between
+manual and automated cleaning, the existing CLEAN* states and cleaning
+mechanism will be reused for both automated and manual cleaning.
+The main differences will be:
+
+* manual cleaning can only be initiated when a node is in the MANAGEABLE state.
+  Once the manual cleaning is finished, the node will be put in the
+  MANAGEABLE state again.
+
+* operators will be able to initiate a manual clean via the modified API
+  to set the nodes's provision state. Details are described in the
+  :ref:`ProvisionCleanAPI` section.
+
+* A manual clean step might need some arguments to be specified. (This might
+  be useful for future automated steps too.) To support this, the
+  ironic.drivers.base.clean_step decorator will be modified to accept a list
+  of arguments. (Default is None.) Each argument is a dictionary with:
+
+  * 'name': <name of argument>
+  * 'description': <description>. This should include possible values.
+  * 'required': Boolean. True if this argument is required -- it must be
+    specified in the manual clean request; false if it is optional.
+
+* add clean steps to drivers that will only be used by manual cleaning. The
+  mechanism for doing this exists already. Driver implementors only need to
+  use the @clean_step decorator with a default cleaning priority of 0. This
+  will ensure the step isn't run as part of the automated cleaning. The
+  implementor can specify whether the step is abortable, and should also
+  include any arguments that can be passed to the clean step.
+
+* operators will be able to get a list of possible steps via an API. The
+  :ref:`CleanStepsAPI` section provides more information.
+
+* similar to executing automated clean steps, when the conductor attempts to
+  execute a manual clean step, it will call execute_clean_step() on the driver
+  responsible for that clean step.
+
+* to avoid confusion, the 'clean_nodes' config will be renamed to
+  'automated_clean_enable' since it only pertains to automated cleaning.
+  The deprecation and deletion of the 'clean_nodes' config will follow
+  ironic's normal deprecation process.
+
+Alternatives
+------------
+
+* We could make manual clean steps and automated clean steps mutually
+  exclusive with separate APIs and terminology and mechanisms to use, but
+  conceptually, since they are all clean steps it is less confusing to
+  provide a similar mechanism for both.
+
+* We could have called 'manual clean' something else like 'zap' to avoid
+  having to distinguish between 'manual' and 'automated' cleaning, but
+  it seems more confusing to describe the differences between 'zap' and 'clean'
+  and that confusion and complexity is apparent when trying to implement it
+  that way.
+
+
+Data model impact
+-----------------
+
+None.
+
+
+State Machine Impact
+--------------------
+
+This:
+
+* removes all mention of 'zap' and the ZAP* states from the `proposed
+  state machine <http://specs.openstack.org/openstack/ironic-specs/specs/kilo-implemented/new-ironic-state-machine.html>`_
+
+* adds two new transitions:
+
+  * MANAGEABLE -> CLEANING via 'clean' verb, to start manual cleaning
+  * CLEANING -> MANAGEABLE via 'manage' verb, to end a successful manual clean
+
+
+REST API impact
+---------------
+
+.. _ProvisionCleanAPI:
+
+PUT /v1/nodes/<node_ident>/states/provision
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This API will allow users to put a node directly into CLEANING
+provision state from MANAGEABLE state via 'target': 'clean'.
+The PUT will also require the argument 'clean_steps' to be specified. This
+is an ordered list of clean steps, with a clean step being represented as a
+dictionary encoded as JSON.
+
+As an example::
+
+  'clean_steps': [{
+      'interface': 'raid'
+      'step': 'create_configuration',
+      'args': {'create_nonroot_volumes': False, // optional keyword argument
+               ... }               // more keyword arguments (if applicable)
+    },
+    {
+      'interface': 'deploy'
+      'step': 'erase_devices'
+    }
+  ]
+
+In the above example, the driver's RAID interface would configure hardware
+RAID without non-root volumes, and then all devices would be erased
+(in that order).
+
+A clean step is represented by a dictionary (JSON), in the form::
+
+  {
+      'interface': <interface>,
+      'step': <name of clean step>,
+      'args': {<arg1>: <value1>, ..., <argn>: <valuen>}
+  }
+
+The 'interface' and 'step' keys are required for all steps. If a step
+takes additional keyword arguments, the 'args' key may be specified. It
+is a dictionary of keyword arguments, with each keyword-argument entry being
+<name>: <value>.
+
+If any step is missing a required keyword argument, no manual cleaning will be
+performed and the node will be put in CLEANFAIL provision state with an
+appropriate error message.
+
+If, during the cleaning process, a clean step determines that it has incorrect
+keyword arguments, all earlier steps will be performed and then the node will
+be put in CLEANFAIL provision state with an appropriate error message.
+
+A new API version is needed to support this.
+
+.. _CleanStepsAPI:
+
+GET /nodes/<node_ident>/cleaning/steps
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We had planned on having an API endpoint to allow operators to see the
+clean steps for an automated cleaning. That proposed API had been
+GET /nodes/<node_ident>/cleaning/clean_steps, but it hasn't been
+implemented yet.
+
+With the introduction of manual cleaning, instead of
+GET /nodes/<node_ident>/cleaning/clean_steps, this proposes replacing that
+with the API endpoint GET /nodes/<node_ident>/cleaning/steps. By default, it
+will return all available clean steps (with priorities of zero and non-zero),
+for both manual and automated cleaning.
+
+An optional field 'min_priority' can be specified to filter for clean
+steps with priorities equal to or above the specified minimum value.
+For example, to only get clean steps for automated cleaning (not manual)::
+
+    GET http://127.0.0.1:6385/v1/nodes/my-awesome-node/cleaning/steps?min_priority=1
+
+The response to this request would be a list of clean steps sorted in
+decreasing priorities, formatted as follows::
+
+  [{
+    // 'interface': is one of 'power', 'management', 'deploy', 'raid'.
+    // 'step': is an opaque identifier used by the driver. Could be a driver
+    //         function name or some function in the agent.
+    // 'priority': is the priority used for determining when to execute
+    //             the step; larger values have higher priority.
+    // 'abortable': True if cleaning can be aborted during execution of this
+    //              step; False otherwise.
+    'interface': 'interface',
+    'step': 'step',
+    'priority': Integer,
+    'abortable': Boolean
+
+    // 'args': a list of keyword arguments that may be included in the
+    //         'PUT /v1/nodes/NNNN/states/provision' request when doing
+    //         a manual clean. An argument is a dictionary with:
+    //           - 'name': <name of argument>
+    //           - 'description': <description>
+    //           - 'required': Boolean. True if required; false if optional
+    'args': []
+   },
+   ... more steps ...
+  ]
+
+An example with a single step::
+
+  [{
+    'interface': 'raid',
+    'step': 'create_configuration',
+    'args': [{'name':'create_root_volume',
+              'description':'Set to True (the default) to create root volume
+                             specified in the node's target_raid_config. False
+                             prevents the root volume from being created.',
+              'required':False},
+             {'name':'create_nonroot_volumes',
+              'description':'Set to True (the default) to create non-root
+                             volumes that may be specified in the node's
+                             target_raid_config. False prevents non-root
+                             volumes from being created.',
+              'required':False}]
+    'priority': 0,
+    'abortable': True
+  }]
+
+If the driver interface cannot synchronously get the list of clean steps,
+for example, because a remote agent is used to determine available clean
+steps, then the driver MUST cache the list of clean steps from the most
+recent execution of said agent and return that. In the absence of such data,
+the driver MAY raise an error, which should be translated by the API service
+into:
+
+  * an HTTP 202
+
+  * a new (we created this) HTTP header 'Retry-Request-After', indicating
+    to the client how long in seconds the client should wait to retry. A '-1'
+    indicates that it is unknown how long to wait. This might happen for
+    example when the request is made when a node is in ENROLL state. At this
+    point it is unknown when the remote agent will be available on the node
+    for querying.
+
+  * a body with a message indicating that the data are not available yet.
+
+If the driver interface can synchronously return the clean steps without
+relying on the hardware or a remote agent, it SHOULD do so, though it
+MAY also rely on the aforementioned caching mechanism.
+
+A new API version is needed to support this.
+
+
+Client (CLI) impact
+-------------------
+
+ironic node-set-provision-state
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A new argument called 'clean-steps' will be added to the
+node-set-provision-state CLI. Its value is a JSON file which is read and the
+contents passed to the API. Thus, the file has the same format as what is
+passed to the API for clean steps.
+
+If the input file is specified as '-', the CLI will read in from stdin, to
+allow piping in the clean steps. Using '-' to signify stdin is common in Unix
+utilities.
+
+The 'clean-steps' argument is required if the requested provision state
+target/verb is "clean". Otherwise, specifying it is considered an error.
+
+ironic node-get-clean-steps
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A new node-get-clean-steps API will be added as follows::
+
+    ironic node-get-clean-steps [--min_priority <priority>] <node>
+
+    <node>: name or UUID of the node
+    --min-priority <priority>: optional minimum priority; default is 0 for all clean steps
+
+If successful, it will return a list of clean steps. If the response from the
+corresponding REST API request is an HTTP 202, it will return the message from
+that response body (that the data are not available) along with a suggestion to
+retry the request again.
+
+
+RPC API impact
+--------------
+
+Add do_node_clean() (as a call()) to the RPC API and bump the RPC API version.
+
+
+Driver API impact
+-----------------
+
+None
+
+
+Nova driver impact
+------------------
+
+None
+
+
+Security impact
+---------------
+
+None
+
+
+Other end user impact
+---------------------
+
+None
+
+
+Scalability impact
+------------------
+
+None
+
+
+Performance Impact
+------------------
+
+None
+
+
+Other deployer impact
+---------------------
+
+None
+
+
+Developer impact
+----------------
+
+None
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  rloo (taking over from JoshNang who has left ironic)
+
+Other contributors:
+  JoshNang (who started this)
+
+
+Work Items
+----------
+
+* Make the changes (as described above) to the state machine
+
+* Bump API microversion to allow manual cleaning and implement the changes
+  to PUT /v1/nodes/(node_ident)/states/provision API (as described above)
+
+* Modify the cleaning flow to allow manual cleaning
+
+* Change execute_clean_steps and get_clean_steps in any asynchronous driver
+  to cache clean steps and return cached clean steps whenever possible.
+
+* Allow APIs to return a Retry-Request-After HTTP header and empty response, in
+  response to a certain exception from drivers.
+
+
+Dependencies
+============
+
+* get_clean_steps API: https://review.openstack.org/#/c/159322
+
+
+Testing
+=======
+
+* Drivers implementing manual cleaning will be expected to test their added
+  features.
+
+
+Upgrades and Backwards Compatibility
+====================================
+
+None
+
+
+Documentation Impact
+====================
+
+The documentation will be updated to describe or clarify automated cleaning and
+manual cleaning and how to configure ironic to do one or both of them:
+
+ * http://docs.openstack.org/developer/ironic/deploy/install-guide.html
+
+ * http://docs.openstack.org/developer/ironic/deploy/cleaning.html
+
+ * http://docs.openstack.org/developer/ironic/webapi/v1.html will be
+   updated to reflect the API version that supports manual cleaning
+
+
+References
+==========
+
+Automated cleaning specification: http://specs.openstack.org/openstack/ironic-specs/specs/kilo-implemented/implement-cleaning-states.html
+
+State machine specification: http://specs.openstack.org/openstack/ironic-specs/specs/kilo-implemented/new-ironic-state-machine.html
+
+Zapping related patches:
+
+*  Launchpad blueprint: https://blueprints.launchpad.net/ironic/+spec/implement-zapping-states
+
+* specification patches:
+    * https://review.openstack.org/#/c/185122/
+    * https://review.openstack.org/#/c/209207/
+
+* code patches:
+    * https://review.openstack.org/#/c/221949/
+    * https://review.openstack.org/#/c/221989/
+    * https://review.openstack.org/#/c/223295/
+    * https://review.openstack.org/#/c/223311/
--- a/specs/liberty/implement-zapping-states.rst
+++ b/specs/liberty/implement-zapping-states.rst
@ -1,13 +0,0 @@
-..
- This work is licensed under a Creative Commons Attribution 3.0 Unported
- License.
-
- http://creativecommons.org/licenses/by/3.0/legalcode
-
-========================
-Implement Zapping States
-========================
-
-This spec was proposed in the Liberty cycle.
-
-See :doc:`../approved/implement-zapping-states`.