ironic/doc/source/deploy/cleaning.rst
Ruby Loo f2d9886f99 Add manual cleaning to documentation
This updates the documentation to include manual cleaning.

Change-Id: I8f91214911e8916c329c20a140e1d0957b1cc137
Partial-Bug: #1526290
2016-02-22 15:46:46 -05:00

331 lines
12 KiB
ReStructuredText

.. _cleaning:
=============
Node cleaning
=============
Overview
========
Ironic provides two modes for node cleaning: ``automated`` and ``manual``.
``Automated cleaning`` is automatically performed before the first
workload has been assigned to a node and when hardware is recycled from
one workload to another.
``Manual cleaning`` must be invoked by the operator.
.. _automated_cleaning:
Automated cleaning
==================
When hardware is recycled from one workload to another, ironic performs
automated cleaning on the node to ensure it's ready for another workload. This
ensures the tenant will get a consistent bare metal node deployed every time.
Ironic implements automated cleaning by collecting a list of cleaning steps
to perform on a node from the Power, Deploy, Management, and RAID interfaces
of the driver assigned to the node. These steps are then ordered by priority
and executed on the node when the node is moved
to ``cleaning`` state, if automated cleaning is enabled.
With automated cleaning, nodes move to ``cleaning`` state when moving from
``active`` -> ``available`` state (when the hardware is recycled from one
workload to another). Nodes also traverse cleaning when going from
``manageable`` -> ``available`` state (before the first workload is
assigned to the nodes). For a full understanding of all state transitions
into cleaning, please see :ref:`states`.
Ironic added support for automated cleaning in the Kilo release.
.. _enabling-cleaning:
Enabling automated cleaning
---------------------------
To enable automated cleaning, ensure that your ironic.conf is set as follows.
(Prior to Mitaka, this option was named 'clean_nodes'.)::
[conductor]
automated_clean=true
This will enable the default set of cleaning steps, based on your hardware and
ironic drivers. If you're using an agent_* driver, this includes, by default,
erasing all of the previous tenant's data.
You may also need to configure a `Cleaning Network`_.
Cleaning steps
--------------
Cleaning steps used for automated cleaning are ordered from higher to lower
priority, where a larger integer is a higher priority. In case of a conflict
between priorities across drivers, the following resolution order is used:
Power, Management, Deploy, and RAID interfaces.
You can skip a cleaning step by setting the priority for that cleaning step
to zero or 'None'.
You can reorder the cleaning steps by modifying the integer priorities of the
cleaning steps.
See `How do I change the priority of a cleaning step?`_ for more information.
Manual cleaning
===============
``Manual cleaning`` is typically used to handle long running, manual, or
destructive tasks that an operator wishes to perform either before the first
workload has been assigned to a node or between workloads. When initiating a
manual clean, the operator specifies the cleaning steps to be performed.
Manual cleaning can only be performed when a node is in the ``manageable``
state. Once the manual cleaning is finished, the node will be put in the
``manageable`` state again.
Ironic added support for manual cleaning in the 4.4 (Mitaka series)
release.
Setup
-----
In order for manual cleaning to work, you may need to configure a
`Cleaning Network`_.
Starting manual cleaning via API
--------------------------------
Manual cleaning can only be performed when a node is in the ``manageable``
state. The REST API request to initiate it is available in API version 1.15 and
higher::
PUT /v1/nodes/<node_ident>/states/provision
(Additional information is available `here <http://docs.openstack.org/developer/ironic/webapi/v1.html#nodes>`_.)
This API will allow operators to put a node directly into ``cleaning``
provision state from ``manageable`` state via 'target': 'clean'.
The PUT will also require the argument 'clean_steps' to be specified. This
is an ordered list of cleaning steps. A cleaning step is represented by a
dictionary (JSON), in the form::
{
'interface': <interface>,
'step': <name of cleaning step>,
'args': {<arg1>: <value1>, ..., <argn>: <valuen>}
}
The 'interface' and 'step' keys are required for all steps. If a cleaning step
method takes keyword arguments, the 'args' key may be specified. It
is a dictionary of keyword variable arguments, with each keyword-argument entry
being <name>: <value>.
If any step is missing a required keyword argument, manual cleaning will not be
performed and the node will be put in ``clean failed`` provision state with an
appropriate error message.
If, during the cleaning process, a cleaning step determines that it has
incorrect keyword arguments, all earlier steps will be performed and then the
node will be put in ``clean failed`` provision state with an appropriate error
message.
An example of the request body for this API::
{
"target":"clean",
"clean_steps": [{
"interface": "raid",
"step": "create_configuration",
"args": {"create_nonroot_volumes": "False"}
},
{
"interface": "deploy",
"step": "erase_devices"
}]
}
In the above example, the driver's RAID interface would configure hardware
RAID without non-root volumes, and then all devices would be erased
(in that order).
Starting manual cleaning via ``ironic`` CLI
-------------------------------------------
Manual cleaning is supported in the ``ironic node-set-provision-state``
command, starting with python-ironicclient 1.2.
The target/verb is 'clean' and the argument 'clean-steps' must be specified.
Its value is one of:
- a JSON string
- path to a JSON file whose contents are passed to the API
- '-', to read from stdin. This allows piping in the clean steps.
Using '-' to signify stdin is common in Unix utilities.
Keep in mind that manual cleaning is only supported in API version 1.15 and
higher.
An example of doing this with a JSON string::
ironic --ironic-api-version 1.15 node-set-provision-state \
clean --clean-steps '{"clean_steps": [...]}'
Or with a file::
ironic --ironic-api-version 1.15 node-set-provision-state \
clean --clean-steps my-clean-steps.txt
Or with stdin::
cat my-clean-steps.txt | ironic --ironic-api-version 1.15 \
node-set-provision-state clean --clean-steps -
Cleaning Network
================
If you are using the Neutron DHCP provider (the default) you will also need to
ensure you have configured a cleaning network. This network will be used to
boot the ramdisk for in-band cleaning. You can use the same network as your
tenant network. For steps to set up the cleaning network, please see
:ref:`CleaningNetworkSetup`.
.. _InbandvsOutOfBandCleaning:
In-band vs out-of-band
======================
Ironic uses two main methods to perform actions on a node: in-band and
out-of-band. Ironic supports using both methods to clean a node.
In-band
-------
In-band steps are performed by ironic making API calls to a ramdisk running
on the node using a Deploy driver. Currently, only the ironic-python-agent
ramdisk used with an agent_* driver supports in-band cleaning. By default,
ironic-python-agent ships with a minimal cleaning configuration, only erasing
disks. However, with this ramdisk, you can add your own cleaning steps and/or
override default cleaning steps with a custom Hardware Manager.
There is currently no support for in-band cleaning using the ironic pxe
ramdisk.
Out-of-band
-----------
Out-of-band are actions performed by your management controller, such as IPMI,
iLO, or DRAC. Out-of-band steps will be performed by ironic using a Power or
Management driver. Which steps are performed depends on the driver and hardware.
For Out-of-Band cleaning operations supported by iLO drivers, refer to
:ref:`ilo_node_cleaning`.
FAQ
===
How are cleaning steps ordered?
-------------------------------
For automated cleaning, cleaning steps are ordered by integer priority, where
a larger integer is a higher priority. In case of a conflict between priorities
across drivers, the following resolution order is used: Power, Management,
Deploy, and RAID interfaces.
For manual cleaning, the cleaning steps should be specified in the desired
order.
How do I skip a cleaning step?
------------------------------
For automated cleaning, cleaning steps with a priority of 0 or None are skipped.
How do I change the priority of a cleaning step?
------------------------------------------------
For manual cleaning, specify the cleaning steps in the desired order.
For automated cleaning, it depends on whether the cleaning steps are
out-of-band or in-band.
Most out-of-band cleaning steps have an explicit configuration option for
priority.
Changing the priority of an in-band (ironic-python-agent) cleaning step
requires use of a custom HardwareManager. The only exception is
``erase_devices``, which can have its priority set in ironic.conf. For instance,
to disable erase_devices, you'd set the following configuration option::
[deploy]
erase_devices_priority=0
To enable/disable the in-band disk erase using ``agent_ilo`` driver, use the
following configuration option::
[ilo]
clean_priority_erase_devices=0
The generic hardware manager first tries to perform ATA disk erase by using
``hdparm`` utility. If ATA disk erase is not supported, it performs software
based disk erase using ``shred`` utility. By default, the number of iterations
performed by ``shred`` for software based disk erase is 1. To configure
the number of iterations, use the following configuration option::
[deploy]
erase_devices_iterations=1
What cleaning step is running?
------------------------------
To check what cleaning step the node is performing or attempted to perform and
failed, either query the node endpoint for the node or run ``ironic node-show
$node_ident`` and look in the `internal_driver_info` field. The `clean_steps`
field will contain a list of all remaining steps with their priorities, and the
first one listed is the step currently in progress or that the node failed
before going into ``clean failed`` state.
Should I disable automated cleaning?
------------------------------------
Automated cleaning is recommended for ironic deployments, however, there are
some tradeoffs to having it enabled. For instance, ironic cannot deploy a new
instance to a node that is currently cleaning, and cleaning can be a time
consuming process. To mitigate this, we suggest using disks with support for
cryptographic ATA Security Erase, as typically the erase_devices step in the
deploy driver takes the longest time to complete of all cleaning steps.
Why can't I power on/off a node while it's cleaning?
----------------------------------------------------
During cleaning, nodes may be performing actions that shouldn't be
interrupted, such as BIOS or Firmware updates. As a result, operators are
forbidden from changing power state via the ironic API while a node is
cleaning.
Troubleshooting
===============
If cleaning fails on a node, the node will be put into ``clean failed`` state
and placed in maintenance mode, to prevent ironic from taking actions on the
node.
Nodes in ``clean failed`` will not be powered off, as the node might be in a
state such that powering it off could damage the node or remove useful
information about the nature of the cleaning failure.
A ``clean failed`` node can be moved to ``manageable`` state, where it cannot
be scheduled by nova and you can safely attempt to fix the node. To move a node
from ``clean failed`` to ``manageable``:
``ironic node-set-provision-state manage``.
You can now take actions on the node, such as replacing a bad disk drive.
Strategies for determining why a cleaning step failed include checking the
ironic conductor logs, viewing logs on the still-running ironic-python-agent
(if an in-band step failed), or performing general hardware troubleshooting on
the node.
When the node is repaired, you can move the node back to ``available`` state,
to allow it to be scheduled by nova.
::
# First, move it out of maintenance mode
ironic node-set-maintenance $node_ident false
# Now, make the node available for scheduling by nova
ironic node-set-provision-state $node_ident provide
The node will begin automated cleaning from the start, and move to
``available`` state when complete.