Manual cleaning

Ironic already provides support for automated cleaning. This
specification describes support for manual cleaning, including an
API for operators to specify a list of clean steps to perform on a
node from the MANAGEABLE state.

Clean steps that are destructive and long running such as
configuring RAID or doing burn in, are good candidates for
manual cleaning instead of automated cleaning.

This feature was formerly called 'zapping'.

Change-Id: Iea975cfc2effc2d8be186294b88d85d8f2ace7b2
blueprint: manual-cleaning
This commit is contained in:
Ruby Loo 2015-10-06 02:22:09 +00:00
parent 529d9f7631
commit 0ea28bec96
3 changed files with 440 additions and 304 deletions

View File

@ -1,291 +0,0 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==========================================
Implement Zapping States
==========================================
https://blueprints.launchpad.net/ironic/+spec/implement-zapping-states
Zapping encompasses all long running, destructive tasks an operator may
want to take either between workloads, or before the first workload has been
assigned to a node.
Problem description
===================
* Operators need some long running work done on nodes before they can be
successfully provisioned.
* Things like firmware updates, setting up new RAID levels, or burning in
nodes often need to be done before a user is given a server, but take
too long to reasonably do at deploy time.
* Operators may want certain clean steps to only run on demand, rather than
every clean cycle. One example is a burn in test before nodes are made
AVAILABLE. By making clean_steps a subset of all possible zap steps,
operators can choose which steps will be run on every clean cycle, and
which will only be initiated by the operator.
* Many of these tasks will provide useful scheduling hints to Nova once
hardware capabilities are introduced. Operators
could use these scheduling hints to create flavors, such as a nova compute
flavor that requires a node with RAID 1 for extra durability.
Proposed change
===============
* Modify the provision state API call which will allow a node in MANAGEABLE
state to go to a ZAPPING state and perform a list of specified ZAPPING steps.
These will be provided to the API as a list of dictionaries encoded as JSON.
* Add zapping steps to drivers, using the @clean_step decorator with a default
cleaning_priority of 0. This will ensure the step isn't run as part of the
automated cleaning between DELETED and AVAILABLE that happens in CLEANING.
* The list of possible ZAPPING steps will be pulled from the list of functions
decorated with @clean_step, which is documented in [1].
* Operators will be able to get a list of possible steps by querying
/nodes/<node_ident>/cleaning/all_steps. This will provide a superset of the
states listed in /nodes/<node_ident>/cleaning/clean_steps, which doesn't list
clean_steps with a cleaning_priority of 0.
* When the conductor attempts to execute a zap step, it will call
execute_clean_step() on the driver responsible for that zap step.
Alternatives
------------
* We could make zap steps and clean steps mutually exclusive, simplifying
some of the API and possible confusion, but limiting zapping and requiring
a second, nearly identical API for executing individual CLEANING states or
duplicating cleaning steps as zap and clean steps. Nearly any step that
can be executed on demand via ZAPPING can be argued to be a necessary step
in CLEANING to provide a consistent platform. For example, if you use
ZAPPING to set up a RAID 10 on the node, you may want to ensure a clean
RAID 10 is presented to every client, and therefore would need to check
and possibly rebuild the RAID 10 in CLEANING. The same can be said for
firmware upgrade (tenants can change firmwares), etc.
Data model impact
-----------------
None
REST API impact
---------------
GET /nodes/<node_ident>/cleaning/all_steps
* An API endpoint should be added to allow operators to see available
zapping steps. This will be similar to
/nodes/<node_ident>/cleaning/clean_steps, but will return all cleaning and
zapping steps, with the format as follows::
[{
// 'interface' is one of : 'power', 'management', 'deploy'
// 'step' is an opaque identifier used by the driver. Could be a driver
// function name, could be some function in the agent.
// 'cleaning_priority' is priority the step would be run at in cleaning.
'interface': 'interface',
'step': 'step',
'cleaning_priority': some_integer,
// a list of required arguments as strings that must be included in
// the PUT to the node's provision state API to move to ZAPPING
'required_args': []
},
... more steps ...
]
* An example with a single step::
[{
'interface': 'management',
'step': 'configure_hardware_raid',
'required_args': ['raid_level']
'cleaning_priority': 0,
}]
* If the driver interface can not synchronously get the list of clean steps
(eg, because a remote agent is used to determine available cleaning steps),
then the driver MUST cache the list of clean steps from the most recent
execution of said agent and return that. In the absence of such data, the
driver MAY raise an error, which should be translated by the API service into
an HTTP RETRY with an indication to the client as to when to retry using a
Retry-After HTTP header. If the driver interface can synchronously return the
cleaning steps, without relying on the hardware or a remote agent, it SHOULD
do so, though it MAY also rely on the aforementioned caching mechanism.
PUT /v1/nodes/<node_ident>/states/provision
* The API will allow users to put a node directly into zapping
provision_state with a PUT from MANAGEABLE state,
the same as how provision state is changed anywhere else in Ironic. On top
of the normal 'target_state': 'zap' , the PUT will require an argument
'zap_steps', which will be a list in the form::
'zap_steps': [{
'interface': 'management'
'step': 'configure_hardware_raid',
'raid_level': 10 // required kwarg
... // more required kwargs (if applicable)
},
{
'interface': 'deploy'
'step': 'erase_devices'
}
}]
Only 'interface' and 'step' are required for all steps. Each step may
require additional kwargs, as noted above. The steps will be executed in the
order provided. If any step is missing a kwarg or has incorrect kwargs, the
node will go to ZAPFAIL with an appropriate error message.
* In the above example, hardware RAID 10 would be configured by the management
driver, then all devices would be erased (in that order).
* The API will be changed to prevent changing power state or provision state
while the node is in a ZAPPING state. A node in ZAPFAIL
state may have its power state changed via the API, because the operator will
likely need to restart the node to fix it.
State Machine Impact
--------------------
Implement/add the following parts of the state machine:
* MANAGEABLE -> ZAPPING (zap)
* ZAPPING -> MANAGEABLE (done)
* ZAPPING -> ZAPFAIL (fail)
* add ZAPFAIL -> ZAPPING (zap)
* add ZAPFAIL -> MANAGEABLE (manage)
Add 'zap' to states.VERBS.
Client (CLI) impact
-------------------
* Add an argument to the node-set-provision-state CLI called
'--zap-steps' that takes a single argument: a JSON file to read and pass to
the API, which has the same format as what is passed to the API for zapping.
If the input file is specified as '-', the CLI will read in from stdin, to
allow piping in the zap steps. Using '-' to signify stdin is common in Unix
utilities. '--zap-steps' will on be required if the requested provision state
is "zap", otherwise, it not allowed.
RPC API impact
--------------
Add do_node_clean to the RPC API, remove cleaning from the
do_provisioning_action RPC API call, and use this same call for zapping.
This should provide the cleanest API.
Driver API impact
-----------------
None
Nova driver impact
------------------
states.py should be synced to the Nova driver, so Nova is aware of zap* states.
Security impact
---------------
None
Other end user impact
---------------------
None
Scalability impact
------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
JoshNang
Work Items
----------
* Add API checks for zap states and allow "zap" as a
provision target action, which will trigger the manageable -> zapping
transition or zapfail -> zapping transition.
* Bump API microversion to add zapping states and "zap" verb.
* Modify the cleaning flow to allow zapping
* Change execute_clean_steps and get_clean_steps in any asynchronous driver
to cache clean/zap steps and return cached clean/zap steps whenever possible.
* Allow APIs to return a Retry-After HTTP header and empty response, in
response to a certain exception from drivers.
Dependencies
============
* get_clean_steps API https://review.openstack.org/#/c/159322
Testing
=======
* Drivers implementing zapping will be expected to test their added
features.
Upgrades and Backwards Compatibility
====================================
None
Documentation Impact
====================
The overlap between cleaning and zapping should be clearly defined.
References
==========
1: https://review.openstack.org/#/c/102685/
2: https://review.openstack.org/#/c/150073/

View File

@ -0,0 +1,440 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
===============
Manual cleaning
===============
https://blueprints.launchpad.net/ironic/+spec/manual-cleaning
Manual cleaning (as opposed to automated cleaning) encompasses all long
running, manual, destructive tasks an operator may want to perform either
between workloads, or before the first workload has been assigned to a node.
This feature had previously been called `"Zapping"
<https://review.openstack.org/#/c/185122/>`_ and this specification copies
a lot of the zapping specification. (Thank you Josh Gachnang!)
Problem description
===================
`Automated cleaning <http://specs.openstack.org/openstack/ironic-specs/specs/kilo-implemented/implement-cleaning-states.html>`_
has been available in ironic since the kilo cycle. It lets operators
choose which clean steps are automatically done prior to the first
time a node is deployed and each time after a node is released.
However, operators may want certain operations or tasks to only run on demand,
rather than in every clean cycle. Things like firmware updates, setting up new
RAID levels, or burning in nodes often need to be done before a user is given
a server, but take too long to reasonably do at deploy time.
Many of the above tasks could provide useful scheduling hints to nova once
hardware capabilities are introduced. Operators could use these scheduling
hints to create flavors, such as a nova compute flavor that requires a node
with RAID 1 for extra durability.
Proposed change
===============
Instead of adding new ZAP* states to the state machine to distinguish between
manual and automated cleaning, the existing CLEAN* states and cleaning
mechanism will be reused for both automated and manual cleaning.
The main differences will be:
* manual cleaning can only be initiated when a node is in the MANAGEABLE state.
Once the manual cleaning is finished, the node will be put in the
MANAGEABLE state again.
* operators will be able to initiate a manual clean via the modified API
to set the nodes's provision state. Details are described in the
:ref:`ProvisionCleanAPI` section.
* A manual clean step might need some arguments to be specified. (This might
be useful for future automated steps too.) To support this, the
ironic.drivers.base.clean_step decorator will be modified to accept a list
of arguments. (Default is None.) Each argument is a dictionary with:
* 'name': <name of argument>
* 'description': <description>. This should include possible values.
* 'required': Boolean. True if this argument is required -- it must be
specified in the manual clean request; false if it is optional.
* add clean steps to drivers that will only be used by manual cleaning. The
mechanism for doing this exists already. Driver implementors only need to
use the @clean_step decorator with a default cleaning priority of 0. This
will ensure the step isn't run as part of the automated cleaning. The
implementor can specify whether the step is abortable, and should also
include any arguments that can be passed to the clean step.
* operators will be able to get a list of possible steps via an API. The
:ref:`CleanStepsAPI` section provides more information.
* similar to executing automated clean steps, when the conductor attempts to
execute a manual clean step, it will call execute_clean_step() on the driver
responsible for that clean step.
* to avoid confusion, the 'clean_nodes' config will be renamed to
'automated_clean_enable' since it only pertains to automated cleaning.
The deprecation and deletion of the 'clean_nodes' config will follow
ironic's normal deprecation process.
Alternatives
------------
* We could make manual clean steps and automated clean steps mutually
exclusive with separate APIs and terminology and mechanisms to use, but
conceptually, since they are all clean steps it is less confusing to
provide a similar mechanism for both.
* We could have called 'manual clean' something else like 'zap' to avoid
having to distinguish between 'manual' and 'automated' cleaning, but
it seems more confusing to describe the differences between 'zap' and 'clean'
and that confusion and complexity is apparent when trying to implement it
that way.
Data model impact
-----------------
None.
State Machine Impact
--------------------
This:
* removes all mention of 'zap' and the ZAP* states from the `proposed
state machine <http://specs.openstack.org/openstack/ironic-specs/specs/kilo-implemented/new-ironic-state-machine.html>`_
* adds two new transitions:
* MANAGEABLE -> CLEANING via 'clean' verb, to start manual cleaning
* CLEANING -> MANAGEABLE via 'manage' verb, to end a successful manual clean
REST API impact
---------------
.. _ProvisionCleanAPI:
PUT /v1/nodes/<node_ident>/states/provision
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This API will allow users to put a node directly into CLEANING
provision state from MANAGEABLE state via 'target': 'clean'.
The PUT will also require the argument 'clean_steps' to be specified. This
is an ordered list of clean steps, with a clean step being represented as a
dictionary encoded as JSON.
As an example::
'clean_steps': [{
'interface': 'raid'
'step': 'create_configuration',
'args': {'create_nonroot_volumes': False, // optional keyword argument
... } // more keyword arguments (if applicable)
},
{
'interface': 'deploy'
'step': 'erase_devices'
}
]
In the above example, the driver's RAID interface would configure hardware
RAID without non-root volumes, and then all devices would be erased
(in that order).
A clean step is represented by a dictionary (JSON), in the form::
{
'interface': <interface>,
'step': <name of clean step>,
'args': {<arg1>: <value1>, ..., <argn>: <valuen>}
}
The 'interface' and 'step' keys are required for all steps. If a step
takes additional keyword arguments, the 'args' key may be specified. It
is a dictionary of keyword arguments, with each keyword-argument entry being
<name>: <value>.
If any step is missing a required keyword argument, no manual cleaning will be
performed and the node will be put in CLEANFAIL provision state with an
appropriate error message.
If, during the cleaning process, a clean step determines that it has incorrect
keyword arguments, all earlier steps will be performed and then the node will
be put in CLEANFAIL provision state with an appropriate error message.
A new API version is needed to support this.
.. _CleanStepsAPI:
GET /nodes/<node_ident>/cleaning/steps
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We had planned on having an API endpoint to allow operators to see the
clean steps for an automated cleaning. That proposed API had been
GET /nodes/<node_ident>/cleaning/clean_steps, but it hasn't been
implemented yet.
With the introduction of manual cleaning, instead of
GET /nodes/<node_ident>/cleaning/clean_steps, this proposes replacing that
with the API endpoint GET /nodes/<node_ident>/cleaning/steps. By default, it
will return all available clean steps (with priorities of zero and non-zero),
for both manual and automated cleaning.
An optional field 'min_priority' can be specified to filter for clean
steps with priorities equal to or above the specified minimum value.
For example, to only get clean steps for automated cleaning (not manual)::
GET http://127.0.0.1:6385/v1/nodes/my-awesome-node/cleaning/steps?min_priority=1
The response to this request would be a list of clean steps sorted in
decreasing priorities, formatted as follows::
[{
// 'interface': is one of 'power', 'management', 'deploy', 'raid'.
// 'step': is an opaque identifier used by the driver. Could be a driver
// function name or some function in the agent.
// 'priority': is the priority used for determining when to execute
// the step; larger values have higher priority.
// 'abortable': True if cleaning can be aborted during execution of this
// step; False otherwise.
'interface': 'interface',
'step': 'step',
'priority': Integer,
'abortable': Boolean
// 'args': a list of keyword arguments that may be included in the
// 'PUT /v1/nodes/NNNN/states/provision' request when doing
// a manual clean. An argument is a dictionary with:
// - 'name': <name of argument>
// - 'description': <description>
// - 'required': Boolean. True if required; false if optional
'args': []
},
... more steps ...
]
An example with a single step::
[{
'interface': 'raid',
'step': 'create_configuration',
'args': [{'name':'create_root_volume',
'description':'Set to True (the default) to create root volume
specified in the node's target_raid_config. False
prevents the root volume from being created.',
'required':False},
{'name':'create_nonroot_volumes',
'description':'Set to True (the default) to create non-root
volumes that may be specified in the node's
target_raid_config. False prevents non-root
volumes from being created.',
'required':False}]
'priority': 0,
'abortable': True
}]
If the driver interface cannot synchronously get the list of clean steps,
for example, because a remote agent is used to determine available clean
steps, then the driver MUST cache the list of clean steps from the most
recent execution of said agent and return that. In the absence of such data,
the driver MAY raise an error, which should be translated by the API service
into:
* an HTTP 202
* a new (we created this) HTTP header 'Retry-Request-After', indicating
to the client how long in seconds the client should wait to retry. A '-1'
indicates that it is unknown how long to wait. This might happen for
example when the request is made when a node is in ENROLL state. At this
point it is unknown when the remote agent will be available on the node
for querying.
* a body with a message indicating that the data are not available yet.
If the driver interface can synchronously return the clean steps without
relying on the hardware or a remote agent, it SHOULD do so, though it
MAY also rely on the aforementioned caching mechanism.
A new API version is needed to support this.
Client (CLI) impact
-------------------
ironic node-set-provision-state
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A new argument called 'clean-steps' will be added to the
node-set-provision-state CLI. Its value is a JSON file which is read and the
contents passed to the API. Thus, the file has the same format as what is
passed to the API for clean steps.
If the input file is specified as '-', the CLI will read in from stdin, to
allow piping in the clean steps. Using '-' to signify stdin is common in Unix
utilities.
The 'clean-steps' argument is required if the requested provision state
target/verb is "clean". Otherwise, specifying it is considered an error.
ironic node-get-clean-steps
~~~~~~~~~~~~~~~~~~~~~~~~~~~
A new node-get-clean-steps API will be added as follows::
ironic node-get-clean-steps [--min_priority <priority>] <node>
<node>: name or UUID of the node
--min-priority <priority>: optional minimum priority; default is 0 for all clean steps
If successful, it will return a list of clean steps. If the response from the
corresponding REST API request is an HTTP 202, it will return the message from
that response body (that the data are not available) along with a suggestion to
retry the request again.
RPC API impact
--------------
Add do_node_clean() (as a call()) to the RPC API and bump the RPC API version.
Driver API impact
-----------------
None
Nova driver impact
------------------
None
Security impact
---------------
None
Other end user impact
---------------------
None
Scalability impact
------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
rloo (taking over from JoshNang who has left ironic)
Other contributors:
JoshNang (who started this)
Work Items
----------
* Make the changes (as described above) to the state machine
* Bump API microversion to allow manual cleaning and implement the changes
to PUT /v1/nodes/(node_ident)/states/provision API (as described above)
* Modify the cleaning flow to allow manual cleaning
* Change execute_clean_steps and get_clean_steps in any asynchronous driver
to cache clean steps and return cached clean steps whenever possible.
* Allow APIs to return a Retry-Request-After HTTP header and empty response, in
response to a certain exception from drivers.
Dependencies
============
* get_clean_steps API: https://review.openstack.org/#/c/159322
Testing
=======
* Drivers implementing manual cleaning will be expected to test their added
features.
Upgrades and Backwards Compatibility
====================================
None
Documentation Impact
====================
The documentation will be updated to describe or clarify automated cleaning and
manual cleaning and how to configure ironic to do one or both of them:
* http://docs.openstack.org/developer/ironic/deploy/install-guide.html
* http://docs.openstack.org/developer/ironic/deploy/cleaning.html
* http://docs.openstack.org/developer/ironic/webapi/v1.html will be
updated to reflect the API version that supports manual cleaning
References
==========
Automated cleaning specification: http://specs.openstack.org/openstack/ironic-specs/specs/kilo-implemented/implement-cleaning-states.html
State machine specification: http://specs.openstack.org/openstack/ironic-specs/specs/kilo-implemented/new-ironic-state-machine.html
Zapping related patches:
* Launchpad blueprint: https://blueprints.launchpad.net/ironic/+spec/implement-zapping-states
* specification patches:
* https://review.openstack.org/#/c/185122/
* https://review.openstack.org/#/c/209207/
* code patches:
* https://review.openstack.org/#/c/221949/
* https://review.openstack.org/#/c/221989/
* https://review.openstack.org/#/c/223295/
* https://review.openstack.org/#/c/223311/

View File

@ -1,13 +0,0 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
========================
Implement Zapping States
========================
This spec was proposed in the Liberty cycle.
See :doc:`../approved/implement-zapping-states`.