Use node.fault field for power fault recovery work

This updates the spec to use a node.fault field as opposed to
a node.maintenance_type field, to keep track of any fault
that is detected by ironic. (ironic will still put the node
in maintenance mode upon detection of a fault.)

A few other changes were made to the spec to clarify things.

Story: #1596107
Change-Id: I675c46e068654019073f5cb3bf4c290291ba405f
This commit is contained in:
Ruby Loo 2018-05-15 11:46:53 -04:00
parent 39ce029c8f
commit d9cf8fefe6
1 changed files with 72 additions and 44 deletions

View File

@ -8,18 +8,17 @@
Power fault recovery
====================
https://bugs.launchpad.net/ironic/+bug/1596107
This spec proposes adding a new node field to differentiate different
maintenance types. And, if possible, recover node from maintenance
state if the node is in maintenance due to power sync failure, and power
sync is succeed.
https://storyboard.openstack.org/#!/story/1596107
This spec proposes adding a new node field to reflect any faults detected
by the system. For nodes put into maintenance mode due to power-sync
failures, this proposes a mechanims to try to recover the nodes from
the failure.
Problem description
===================
Currently, ironic's API and data model makes no differentiation between ironic
Currently, ironic's API and data model do not differentiate between ironic
setting maintenance on a node (for instance, when cleaning fails, or when the
power status loop cannot contact the BMC) and an operator setting maintenance
on a node (an explicit API action).
@ -29,47 +28,67 @@ This situation makes it difficult to apply any power fault recovery mechanism.
Proposed change
===============
A string field named ``maintenance_type`` will be added to ironic data model,
it will be used to record different maintenance types.
A string field named ``fault`` will be added to ironic's data model for a Node.
It will be used to record any detected faults.
The field will be set to following values in different cases:
The field will be set to one of the following values, depending on the
situation (when ironic puts the node into maintenance due to one of these):
* ``manual``: when a node is explicitly put into maintenance from ironic API.
* ``power failure``: when a node is entering maintenance due to power sync
failure exceeded max retries.
* ``clean failure``: when a node is entering maintenance due to failure of a
* ``power failure``: when a node is put into maintenance due to power sync
failures that exceed max retries.
* ``clean failure``: when a node is put into maintenance due to failure of a
cleaning operation.
* ``rescue abort failure``: when a node is entering maintenance due to failure
* ``rescue abort failure``: when a node is put into maintenance due to failure
of cleaning up during rescue abort.
* None: there is no fault detected for the node; this is the default value.
Since the field is set to a fault when ironic puts the node into maintenance
due to a fault, it gets set to None when the node is taken out of
maintenance (by ironic or the user).
The field will be set to None if not in maintenance.
For nodes that were in maintenance prior to upgrading to this release,
the field will also be None, since we don't know for sure, if (or what)
there had been a fault. Even if there had been, we don't know whether the
operator wants the node in maintenance for other reasons as well.
Add a periodic task to conductor to filter out nodes in maintenance and
``maintenance_type`` is ``power failure``, trying to get power status, and
bring nodes out of maintenance once the power status is retrieved
successfully.
A periodic task will be added to the conductor. This will operate on nodes
with ``node.fault == power failure``. It will try to get the node's power
state. If successful, the node will be taken out of maintenance.
Add a new integer option named ``[conductor]power_failure_recovery_interval``
to config how many seconds between each run of periodic task. Set to 0 will
disable power fault recovery, the default value is 300 (5 minutes).
An integer configuration option named
``[conductor]power_failure_recovery_interval`` will be added. This is the
interval (number of seconds) between each run of the periodic task. The
default value is 300 seconds (5 minutes). Setting it to 0 will disable
power fault recovery.
Alternatives
------------
Proposals have already been made and rejected to use the combination of
Proposals have already been made and rejected to use a combination of
maintenance being set and a given maintenance_reason or last_error to do
self-healing due to inconsistency.
Specific faults support [#]_ is proposed but it's too big and doesn't reach
a consensus.
Specific faults support [#]_ is proposed but it's too big and to-date, no
consensus has been reached on it.
An earlier version of this spec had proposed a ``maintenance_type``
field for the node, instead of a ``fault`` field. We prefer the
``fault`` field since it better reflects that it is about faults.
``maintenance_type`` caused the field to be tied in with the
existing ``maintenance`` and ``maintenance_reason`` fields.
Using ``fault`` makes it distinct from ``maintenance``; it will
make it easier to move forward in the future, should we decide to
provide more enhancements that are fault-related, or to separate
the notion and handling of faults from maintenance (which some
would argue should only be operator-initiated).
Data model impact
-----------------
A string field named ``maintenance_type`` will be added to ironic node table.
A string field named ``fault`` will be added to ironic's node table.
The field will be added to ``Node`` object, object version will be increased.
The field will be added to ``Node`` object; object version will be
incremented.
State Machine Impact
--------------------
@ -79,10 +98,10 @@ None
REST API impact
---------------
The field ``maintenance_type`` will be added to ``Node`` API object, ironic
The field ``fault`` will be added to ``Node`` API object, ironic
API version will be bumped.
Node related API endpoints will be affected:
Node-related API endpoints will be affected:
* POST /v1/nodes
* GET /v1/nodes
@ -90,11 +109,11 @@ Node related API endpoints will be affected:
* GET /v1/nodes/{node_ident}
* PATCH /v1/nodes/{node_ident}
For clients with older microversion, the ``maintenance_type`` is hidden from
the response, this will be done in the ``hide_fields_in_newer_versions``.
For requests with older microversion, ``node.fault`` is hidden from
the response. This will be done in the ``hide_fields_in_newer_versions()``.
Field ``maintenance_type`` is read only from API, and can only be modified
internally. Any POST/PATCH request to Node with ``maintenance_type`` set will
Field ``fault`` is read-only from the API, and can only be modified
internally. Any POST/PATCH request to Node with ``fault`` set will
get 400 (Bad Request).
There is no other impact to API behaviors.
@ -111,7 +130,7 @@ None
"openstack baremetal" CLI
~~~~~~~~~~~~~~~~~~~~~~~~~
The CLI will be updated to support field ``maintenance_type``, guarded
The CLI will be updated to support field ``fault``, guarded
by ironic API microversion.
RPC API impact
@ -127,7 +146,9 @@ None
Nova driver impact
------------------
None
None, although we should update the ironic-virt driver code to
also look at this new node.fault field, in addition to the
``node.maintenance`` field.
Ramdisk impact
--------------
@ -185,13 +206,12 @@ Other contributors:
Work Items
----------
* Update db layer to include the ``maintenance_type`` field.
* Change places where nodes entering/leaving maintenance to set
``maintenance_type`` accordingly.
* Update db layer to include the ``fault`` field.
* Change places where nodes enter/leave maintenance, to set
``fault`` accordingly.
* Add ``[conductor]power_failure_recovery_interval`` option to ironic
configuration, add periodic task to handle power recovery.
* API change.
* Handle maintenance data migration on db upgrade.
Dependencies
============
@ -211,15 +231,23 @@ Upgrades and Backwards Compatibility
ironic API change is guarded by microversion.
dbsync and db api will be updated to facilitate populating the
``maintenance_type`` field to ``manual`` during data migration for nodes
previously been put under maintenance.
When upgrading, any nodes in maintenance will have ``node.fault`` set to None.
This is because there is no easy/guaranteed way to determine if a node had
been (previously) put into maintenance due to one of our internal faults.
Furthermore, even if we could determine whether it was due to a fault, we don't
know whether the operator wants to keep the node in maintenance for other
reasons besides the fault.
Should the operator want to take advantage of the power fault recovery
mechanism, they could take the nodes out of maintenance. If there are still
issues, ironic will do its thing -- detect the fault and try to recover.
Documentation Impact
====================
* Update api-reference.
* New option will generated by config generator.
* New option will be generated by config generator.
References
==========