diff --git a/doc/source/admin/hardware-burn-in.rst b/doc/source/admin/hardware-burn-in.rst new file mode 100644 index 0000000000..0792d58ff4 --- /dev/null +++ b/doc/source/admin/hardware-burn-in.rst @@ -0,0 +1,166 @@ +.. _hardware-burn-in: + +================ +Hardware Burn-in +================ + +Overview +======== + +Workflows to onboard new hardware often include a stress-testing step to +provoke early failures and to avoid that these load-triggered issues only +occur when the nodes have already moved to production. These ``burn-in`` +tests typically include CPU, memory, disk, and network. With the Xena +release, Ironic supports such tests as part of the cleaning framework. + +The burn-in steps rely on standard tools such as +`stress-ng `_ for CPU +and memory, or `fio `_ for disk and +network. The burn-in cleaning steps are part of the generic hardware manager +in the Ironic Python Agent (IPA) and therefore the agent ramdisk does not +need to be bundled with a specific +:ironic-python-agent-doc:`IPA hardware manager +` to have them available. + +Each burn-in step accepts (or in the case of network: needs) some basic +configuration options, mostly to limit the duration of the test and to +specify the amount of resources to be used. The options are set on a node's +``driver-info`` and prefixed with ``agent_burnin_``. The options available +for the individual tests will be outlined below. + +CPU burn-in +=========== + +The options, following a `agent_burnin_` + stress-ng stressor (`cpu`) + +stress-ng option schema, are: + +* ``agent_burnin_cpu_timeout`` (default: 24 hours) +* ``agent_burnin_cpu_cpu`` (default: 0, meaning all CPUs) + +to limit the overall runtime and to pick the number of CPUs to stress. + +For instance, in order to limit the time of the CPU burn-in to 10 minutes +do: + +.. code-block:: console + + baremetal node set --driver-info agent_burnin_cpu_timeout=600 \ + $NODE_NAME_OR_UUID + +Then launch the test with: + +.. code-block:: console + + baremetal node clean --clean-steps '[{"step": "burnin_cpu", \ + "interface": "deploy"}]' $NODE_NAME_OR_UUID + +Memory burn-in +============== + +The options, following a `agent_burnin_` + stress-ng stressor (`vm`) + +stress-ng option schema, are: + +* ``agent_burnin_vm_timeout`` (default: 24 hours) +* ``agent_burnin_vm_vm-bytes`` (default: 98%) + +to limit the overall runtime and to set the fraction of RAM to stress. + +For instance, in order to limit the time of the memory burn-in to 1 hour +and the amount of RAM to be used to 75% run: + +.. code-block:: console + + baremetal node set --driver-info agent_burnin_vm_timeout=3600 \ + $NODE_NAME_OR_UUID + baremetal node set --driver-info agent_burnin_vm_vm-bytes=75 \ + $NODE_NAME_OR_UUID + +Then launch the test with: + +.. code-block:: console + + baremetal node clean --clean-steps '[{"step": "burnin_vm", \ + "interface": "deploy"}]' $NODE_NAME_OR_UUID + +Disk burn-in +============ + +The options, following a `agent_burnin_` + fio stressor (`fio_disk`) + +fio option schema, are: + +* agent_burnin_fio_disk_runtime (default: 0, meaning no time limit) +* agent_burnin_fio_disk_loops (default: 4) + +to set the time limit and the number of iterations when going +over the disks. + +For instance, in order to limit the number of loops to 2 set: + +.. code-block:: console + + baremetal node set --driver-info agent_burnin_fio_disk_loops=2 \ + $NODE_NAME_OR_UUID + +Then launch the test with: + +.. code-block:: console + + baremetal node clean --clean-steps '[{"step": "burnin_disk", \ + "interface": "deploy"}]' $NODE_NAME_OR_UUID + + +Network burn-in +=============== + +Burning in the network needs a little more config, since we need a pair +of nodes to perform the test. Therefore, this test needs to set +``agent_burnin_fio_network_config`` JSON which requires a ``role`` field +(values: ``reader``, ``writer``) and a ``partner`` field (value is the +hostname of the other node to test), like: + +.. code-block:: console + + baremetal node set --driver-info agent_burnin_fio_network_config= \ + '{"role": "writer", "partner": "$HOST2"}' $NODE_NAME_OR_UUID1 + baremetal node set --driver-info agent_burnin_fio_network_config= \ + '{"role": "reader", "partner": "$HOST1"}' $NODE_NAME_OR_UUID2 + +In addition and similar to the other tests, there is a runtime option +to be set (only on the writer): + +.. code-block:: console + + baremetal node set --driver-info agent_burnin_fio_network_runtime=600 \ + $NODE_NAME_OR_UUID + +Then launch the test with: + +.. code-block:: console + + baremetal node clean --clean-steps '[{"step": "burnin_network",\ + "interface": "deploy"}]' $NODE_NAME_OR_UUID1 + baremetal node clean --clean-steps '[{"step": "burnin_network",\ + "interface": "deploy"}]' $NODE_NAME_OR_UUID2 + +Both nodes will wait for the other node to show up and block while waiting. +If the partner does not show up, the cleaning timeout will step in. + +Additional Information +====================== + +All tests can be aborted at any moment with + +.. code-block:: console + + baremetal node abort $NODE_NAME_OR_UUID + +One can also launch multiple tests which will be run in sequence, e.g.: + +.. code-block:: console + + baremetal node clean --clean-steps '[{"step": "burnin_cpu",\ + "interface": "deploy"}, {"step": "burnin_memory",\ + "interface": "deploy"}]' $NODE_NAME_OR_UUID + +If desired, configuring ``fast-track`` may be helpful here as it allows +to keep the node up between consecutive calls of ``baremetal node clean``. diff --git a/doc/source/admin/index.rst b/doc/source/admin/index.rst index 86b1e55ff3..28f55d021c 100644 --- a/doc/source/admin/index.rst +++ b/doc/source/admin/index.rst @@ -31,6 +31,7 @@ the services. Fast-Track Deployment Booting a Ramdisk or an ISO Deploying with anaconda deploy interface + Hardware Burn-in Drivers, Hardware Types and Hardware Interfaces -----------------------------------------------