Merge "Add get_service_steps logic to the agent"

This commit is contained in:
Zuul 2023-09-15 22:29:59 +00:00 committed by Gerrit Code Review
commit 73b76da5fe
10 changed files with 672 additions and 2 deletions

View File

@ -6,7 +6,8 @@ GenericHardwareManager
======================
This is the default hardware manager for ironic-python-agent. It provides
support for :ref:`hardware-inventory` and the default deploy and clean steps.
support for :ref:`hardware-inventory` and the default deploy, clean,
and service steps.
Deploy steps
------------
@ -104,6 +105,57 @@ Clean steps
and must be used through the :ironic-doc:`ironic RAID feature
<admin/raid.html>`.
Service steps
-------------
Service steps can be invoked by an operator of a baremetal node, to modify
or perform some intermediate action outside the realm of normal use of a
deployed bare metal instance. This is similar in form of interaction to
cleaning, and ultimately some cleaning and deployment steps *are* available
to be used.
``deploy.burnin_cpu``
Stress-test the CPUs of a node via stress-ng for a configurable
amount of time.
``deploy.burnin_memory``
Stress-test the memory of a node via stress-ng for a configurable
amount of time.
``deploy.burnin_network``
Stress-test the network of a pair of nodes via fio for a configurable
amount of time.
``raid.create_configuration``
Create a RAID configuration. This step belongs to the ``raid`` interface
and must be used through the :ironic-doc:`ironic RAID feature
<admin/raid.html>`.
``raid.apply_configuration(node, ports, raid_config, delete_existing=True)``
Apply a software RAID configuration. It belongs to the ``raid`` interface
and must be used through the :ironic-doc:`ironic RAID feature
<admin/raid.html>`.
``raid.delete_configuration``
Delete the RAID configuration. This step belongs to the ``raid`` interface
and must be used through the :ironic-doc:`ironic RAID feature
<admin/raid.html>`.
``deploy.write_image(node, ports, image_info, configdrive=None)``
A step backing the ``write_image`` deploy step of the
:ironic-doc:`direct deploy interface
<admin/interfaces/deploy.html#direct-deploy>`.
Should not be used explicitly, but can be overridden to provide a custom
way of writing an image.
``deploy.inject_files(node, ports, files, verify_ca=True)``
A step to inject files into a system. Specifically this step is documented
earlier in this documentation.
.. NOTE::
The Ironic Developers chose to limit the items available for service steps
such that the risk of data distruction is generally minimized.
That being said, it could be reasonable to reconfigure RAID devices through
local hardware managers *or* to write the base OS image as part of a
service operation. As such, caution should be taken, and if additional data
erasure steps are needed you may want to consider moving a node through
cleaning to remove the workload. Otherwise, if you have a use case, please
feel free to reach out to the Ironic Developers so we can understand and
enable your use case.
Cleaning safeguards
-------------------
@ -194,3 +246,10 @@ Each settings in the list is a dictionary with the following fields:
The per-function configuration of the first port of the NIC
``function1Config``
The per-function configuration of the second port of the NIC
Service steps
-------------
The Clean steps supported by the MellanoxDeviceHardwareManager are also
available as Service steps if an infrastructure operator wishes to apply
new firmware for a running machine.

View File

@ -246,6 +246,52 @@ There are two kinds of deploy steps:
def write_a_file(self, node, ports, path, contents, mode=0o644):
pass # Mount the disk, write a file.
Custom HardwareManagers and Service operations
----------------------------------------------
Starting with the Bobcat release cycle, A hardware manager can define
*service steps* that may be run during a service operation by exposing a
``get_service_steps`` call.
Service steps are intended to be invoked by an operator to perform an ad-hoc
action upon a node. This does not include automatic step execution, but may
at some point in the future. The result is that steps can be exposed similar
to Clean steps and Deploy steps, just the priority value, should be 0 as
the user requested order is what is utilized.
.. code-block:: python
def get_deploy_steps(self, node, ports):
return [
{
# A function on the custom hardware manager
'step': 'write_a_file',
# Steps with priority 0 don't run by default.
'priority': 0,
# Should be the deploy interface, unless there is driver-side
# support for another interface (as it is for RAID).
'interface': 'deploy',
# Arguments that can be required or optional.
'argsinfo': {
'path': {
'description': 'Path to file',
'required': True,
},
'content': {
'description': 'Content of the file',
'required': True,
},
'mode': {
'description': 'Mode of the file, defaults to 0644',
'required': False,
},
}
}
]
def write_a_file(self, node, ports, path, contents, mode=0o644):
pass # Mount the disk, write a file.
Versioning
~~~~~~~~~~
Each hardware manager has a name and a version. This version is used during

View File

@ -312,6 +312,15 @@ class DeploymentError(RESTError):
super(DeploymentError, self).__init__(details)
class ServicingError(RESTError):
"""Error raised when a service step fails."""
message = 'Service step failed'
def __init__(self, details=None):
super(ServicingError, self).__init__(details)
class IncompatibleNumaFormatError(RESTError):
"""Error raised when unexpected format data in NUMA node."""

View File

@ -0,0 +1,103 @@
# Copyright 2015 Rackspace, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ironic_lib import exception as il_exc
from oslo_log import log
from ironic_python_agent import errors
from ironic_python_agent.extensions import base
from ironic_python_agent import hardware
LOG = log.getLogger()
class ServiceExtension(base.BaseAgentExtension):
@base.sync_command('get_service_steps')
def get_service_steps(self, node, ports):
"""Get the list of service steps supported for the node and ports
:param node: A dict representation of a node
:param ports: A dict representation of ports attached to node
:returns: A list of service steps with keys step, priority, and
reboot_requested
"""
LOG.debug('Getting service steps, called with node: %(node)s, '
'ports: %(ports)s', {'node': node, 'ports': ports})
hardware.cache_node(node)
# Results should be a dict, not a list
candidate_steps = hardware.dispatch_to_all_managers(
'get_service_steps', node, ports)
LOG.debug('Service steps before deduplication: %s', candidate_steps)
service_steps = hardware.deduplicate_steps(candidate_steps)
LOG.debug('Returning service steps: %s', service_steps)
return {
'service_steps': service_steps,
'hardware_manager_version': hardware.get_current_versions(),
}
@base.async_command('execute_service_step')
def execute_service_step(self, step, node, ports, service_version=None,
**kwargs):
"""Execute a service step.
:param step: A step with 'step', 'priority' and 'interface' keys
:param node: A dict representation of a node
:param ports: A dict representation of ports attached to node
:param service_version: The service version as returned by
hardware.get_current_versions() at the beginning
of the service operation.
:returns: a CommandResult object with command_result set to whatever
the step returns.
"""
# Ensure the agent is still the same version, or raise an exception
LOG.debug('Executing service step %s', step)
hardware.cache_node(node)
hardware.check_versions(service_version)
if 'step' not in step:
msg = 'Malformed service_step, no "step" key: %s' % step
LOG.error(msg)
raise ValueError(msg)
kwargs.update(step.get('args') or {})
try:
result = hardware.dispatch_to_managers(step['step'], node, ports,
**kwargs)
except (errors.RESTError, il_exc.IronicException):
LOG.exception('Error performing service step %s', step['step'])
raise
except Exception as e:
msg = ('Unexpected exception performing service step %(step)s. '
'%(cls)s: %(err)s' % {'step': step['step'],
'cls': e.__class__.__name__,
'err': e})
LOG.exception(msg)
raise errors.ServicingError(msg)
LOG.info('Service step completed: %(step)s, result: %(result)s',
{'step': step, 'result': result})
# Cast result tuples (like output of utils.execute) as lists, or
# API throws errors
if isinstance(result, tuple):
result = list(result)
# Return the step that was executed so we can dispatch
# to the appropriate Ironic interface
return {
'service_result': result,
'service_step': step
}

View File

@ -1145,6 +1145,60 @@ class HardwareManager(object, metaclass=abc.ABCMeta):
"""
return []
def get_service_steps(self, node, ports):
"""Get a list of service steps.
Returns a list of steps. Each step is represented by a dict::
{
'interface': the name of the driver interface that should execute
the step.
'step': the HardwareManager function to call.
'priority': the order steps will be run in if excuted upon
similar to automated cleaning or deployment.
In service steps, the order comes from the user request,
but this similarity is kept for consistency should we
further extend the capability at some point in the
future.
'reboot_requested': Whether the agent should request Ironic reboots
the node via the power driver after the
operation completes.
'abortable': Boolean value. Whether the service step can be
stopped by the operator or not. Some steps may
cause non-reversible damage to a machine if interrupted
(i.e firmware update), for such steps this parameter
should be set to False. If no value is set for this
parameter, Ironic will consider False (non-abortable).
}
If multiple hardware managers return the same step name, the following
logic will be used to determine which manager's step "wins":
* Keep the step that belongs to HardwareManager with highest
HardwareSupport (larger int) value.
* If equal support level, keep the step with the higher defined
priority (larger int).
* If equal support level and priority, keep the step associated
with the HardwareManager whose name comes earlier in the
alphabet.
The steps will be called using `hardware.dispatch_to_managers` and
handled by the best suited hardware manager. If you need a step to be
executed by only your hardware manager, ensure it has a unique step
name.
`node` and `ports` can be used by other hardware managers to further
determine if a step is supported for the node.
:param node: Ironic node object
:param ports: list of Ironic port objects
:return: a list of service steps, where each step is described as a
dict as defined above
"""
return []
def get_version(self):
"""Get a name and version for this hardware manager.
@ -1186,7 +1240,8 @@ class HardwareManager(object, metaclass=abc.ABCMeta):
class GenericHardwareManager(HardwareManager):
HARDWARE_MANAGER_NAME = 'generic_hardware_manager'
# 1.1 - Added new clean step called erase_devices_metadata
HARDWARE_MANAGER_VERSION = '1.1'
# 1.2 - Added new get_service_steps method
HARDWARE_MANAGER_VERSION = '1.2'
def __init__(self):
self.lldp_data = {}
@ -2399,6 +2454,77 @@ class GenericHardwareManager(HardwareManager):
},
]
# TODO(TheJulia): There has to be a better way, we should
# make this less copy paste. That being said, I can also see
# unique priorites being needed.
def get_service_steps(self, node, ports):
service_steps = [
{
'step': 'delete_configuration',
'priority': 0,
'interface': 'raid',
'reboot_requested': False,
'abortable': True
},
{
'step': 'apply_configuration',
'priority': 0,
'interface': 'raid',
'reboot_requested': False,
'argsinfo': RAID_APPLY_CONFIGURATION_ARGSINFO,
},
{
'step': 'create_configuration',
'priority': 0,
'interface': 'raid',
'reboot_requested': False,
'abortable': True
},
{
'step': 'burnin_cpu',
'priority': 0,
'interface': 'deploy',
'reboot_requested': False,
'abortable': True
},
# NOTE(TheJulia): Burnin disk is explicilty not carried in this
# list because it would be distructive to data on a disk.
# If someone needs to do that, the machine should be
# unprovisioned.
{
'step': 'burnin_memory',
'priority': 0,
'interface': 'deploy',
'reboot_requested': False,
'abortable': True
},
{
'step': 'burnin_network',
'priority': 0,
'interface': 'deploy',
'reboot_requested': False,
'abortable': True
},
{
'step': 'write_image',
# NOTE(dtantsur): this step has to be proxied via an
# out-of-band step with the same name, hence the priority here
# doesn't really matter.
'priority': 0,
'interface': 'deploy',
'reboot_requested': False,
},
{
'step': 'inject_files',
'priority': CONF.inject_files_priority,
'interface': 'deploy',
'reboot_requested': False,
'argsinfo': inject_files.ARGSINFO,
},
]
# TODO(TheJulia): Consider erase_devices and friends...
return service_steps
def apply_configuration(self, node, ports, raid_config,
delete_existing=True):
"""Apply RAID configuration.

View File

@ -158,6 +158,14 @@ class MellanoxDeviceHardwareManager(hardware.HardwareManager):
}
]
def get_service_steps(self, node, ports):
"""Alias wrapper for method get_clean_steps."""
# NOTE(TheJulia): Since these steps can be run upon service, why not
return self.get_clean_steps(node, ports)
# TODO(TheJulia): Should there be a get_deploy_steps handler here? Since
# flashing firmware on deploy is a valid case.
def update_nvidia_nic_firmware_image(self, node, ports, images):
nvidia_fw_update.update_nvidia_nic_firmware_image(images)

View File

@ -0,0 +1,300 @@
# Copyright 2015 Rackspace, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from unittest import mock
from ironic_python_agent import errors
from ironic_python_agent.extensions import service
from ironic_python_agent.tests.unit import base
@mock.patch('ironic_python_agent.hardware.cache_node', autospec=True)
class TestServiceExtension(base.IronicAgentTest):
def setUp(self):
super(TestServiceExtension, self).setUp()
self.agent_extension = service.ServiceExtension()
self.node = {'uuid': 'dda135fb-732d-4742-8e72-df8f3199d244'}
self.ports = []
self.step = {
'GenericHardwareManager':
[{'step': 'erase_devices',
'priority': 10,
'interface': 'deploy'}]
}
self.version = {'generic': '1', 'specific': '1'}
@mock.patch('ironic_python_agent.hardware.get_current_versions',
autospec=True)
@mock.patch('ironic_python_agent.hardware.dispatch_to_all_managers',
autospec=True)
def test_get_service_steps(self, mock_dispatch, mock_version,
mock_cache_node):
mock_version.return_value = self.version
manager_steps = {
'SpecificHardwareManager': [
{
'step': 'erase_devices',
'priority': 10,
'interface': 'deploy',
'reboot_requested': False
},
{
'step': 'upgrade_bios',
'priority': 20,
'interface': 'deploy',
'reboot_requested': True
},
{
'step': 'upgrade_firmware',
'priority': 60,
'interface': 'deploy',
'reboot_requested': False
},
],
'FirmwareHardwareManager': [
{
'step': 'upgrade_firmware',
'priority': 10,
'interface': 'deploy',
'reboot_requested': False
},
{
'step': 'erase_devices',
'priority': 40,
'interface': 'deploy',
'reboot_requested': False
},
],
'DiskHardwareManager': [
{
'step': 'erase_devices',
'priority': 50,
'interface': 'deploy',
'reboot_requested': False
},
]
}
expected_steps = {
'SpecificHardwareManager': [
# Only manager upgrading BIOS
{
'step': 'upgrade_bios',
'priority': 20,
'interface': 'deploy',
'reboot_requested': True
}
],
'FirmwareHardwareManager': [
# Higher support than specific, even though lower priority
{
'step': 'upgrade_firmware',
'priority': 10,
'interface': 'deploy',
'reboot_requested': False
},
],
'DiskHardwareManager': [
# Higher support than specific, higher priority than firmware
{
'step': 'erase_devices',
'priority': 50,
'interface': 'deploy',
'reboot_requested': False
},
]
}
hardware_support = {
'SpecificHardwareManager': 3,
'FirmwareHardwareManager': 4,
'DiskHardwareManager': 4
}
mock_dispatch.side_effect = [manager_steps, hardware_support]
expected_return = {
'hardware_manager_version': self.version,
'service_steps': expected_steps
}
async_results = self.agent_extension.get_service_steps(
node=self.node,
ports=self.ports)
# Ordering of the service steps doesn't matter; they're sorted by
# 'priority' in Ironic, and executed upon by user submission order
# in ironic.
self.assertEqual(expected_return,
async_results.join().command_result)
mock_cache_node.assert_called_once_with(self.node)
@mock.patch('ironic_python_agent.hardware.dispatch_to_managers',
autospec=True)
@mock.patch('ironic_python_agent.hardware.check_versions',
autospec=True)
def test_execute_service_step(self, mock_version, mock_dispatch,
mock_cache_node):
result = 'cleaned'
mock_dispatch.return_value = result
expected_result = {
'service_step': self.step['GenericHardwareManager'][0],
'service_result': result
}
async_result = self.agent_extension.execute_service_step(
step=self.step['GenericHardwareManager'][0],
node=self.node, ports=self.ports,
service_version=self.version)
async_result.join()
mock_version.assert_called_once_with(self.version)
mock_dispatch.assert_called_once_with(
self.step['GenericHardwareManager'][0]['step'],
self.node, self.ports)
self.assertEqual(expected_result, async_result.command_result)
mock_cache_node.assert_called_once_with(self.node)
@mock.patch('ironic_python_agent.hardware.dispatch_to_managers',
autospec=True)
@mock.patch('ironic_python_agent.hardware.check_versions',
autospec=True)
def test_execute_service_step_tuple_result(self, mock_version,
mock_dispatch, mock_cache_node):
result = ('stdout', 'stderr')
mock_dispatch.return_value = result
expected_result = {
'service_step': self.step['GenericHardwareManager'][0],
'service_result': ['stdout', 'stderr']
}
async_result = self.agent_extension.execute_service_step(
step=self.step['GenericHardwareManager'][0],
node=self.node, ports=self.ports,
service_version=self.version)
async_result.join()
mock_version.assert_called_once_with(self.version)
mock_dispatch.assert_called_once_with(
self.step['GenericHardwareManager'][0]['step'],
self.node, self.ports)
self.assertEqual(expected_result, async_result.command_result)
mock_cache_node.assert_called_once_with(self.node)
@mock.patch('ironic_python_agent.hardware.dispatch_to_managers',
autospec=True)
@mock.patch('ironic_python_agent.hardware.check_versions',
autospec=True)
def test_execute_service_step_with_args(self, mock_version, mock_dispatch,
mock_cache_node):
result = 'cleaned'
mock_dispatch.return_value = result
step = self.step['GenericHardwareManager'][0]
step['args'] = {'foo': 'bar'}
expected_result = {
'service_step': step,
'service_result': result
}
async_result = self.agent_extension.execute_service_step(
step=self.step['GenericHardwareManager'][0],
node=self.node, ports=self.ports,
service_version=self.version)
async_result.join()
mock_version.assert_called_once_with(self.version)
mock_dispatch.assert_called_once_with(
self.step['GenericHardwareManager'][0]['step'],
self.node, self.ports, foo='bar')
self.assertEqual(expected_result, async_result.command_result)
mock_cache_node.assert_called_once_with(self.node)
@mock.patch('ironic_python_agent.hardware.check_versions',
autospec=True)
def test_execute_service_step_no_step(self, mock_version, mock_cache_node):
async_result = self.agent_extension.execute_service_step(
step={}, node=self.node, ports=self.ports,
service_version=self.version)
async_result.join()
self.assertEqual('FAILED', async_result.command_status)
mock_version.assert_called_once_with(self.version)
mock_cache_node.assert_called_once_with(self.node)
@mock.patch('ironic_python_agent.hardware.dispatch_to_managers',
autospec=True)
@mock.patch('ironic_python_agent.hardware.check_versions',
autospec=True)
def test_execute_service_step_fail(self, mock_version, mock_dispatch,
mock_cache_node):
err = errors.BlockDeviceError("I'm a teapot")
mock_dispatch.side_effect = err
async_result = self.agent_extension.execute_service_step(
step=self.step['GenericHardwareManager'][0], node=self.node,
ports=self.ports, service_version=self.version)
async_result.join()
self.assertEqual('FAILED', async_result.command_status)
self.assertEqual(err, async_result.command_error)
mock_version.assert_called_once_with(self.version)
mock_dispatch.assert_called_once_with(
self.step['GenericHardwareManager'][0]['step'],
self.node, self.ports)
mock_cache_node.assert_called_once_with(self.node)
@mock.patch('ironic_python_agent.hardware.dispatch_to_managers',
autospec=True)
@mock.patch('ironic_python_agent.hardware.check_versions',
autospec=True)
def test_execute_service_step_exception(self, mock_version, mock_dispatch,
mock_cache_node):
mock_dispatch.side_effect = RuntimeError('boom')
async_result = self.agent_extension.execute_service_step(
step=self.step['GenericHardwareManager'][0], node=self.node,
ports=self.ports, service_version=self.version)
async_result.join()
self.assertEqual('FAILED', async_result.command_status)
self.assertIn('RuntimeError: boom', str(async_result.command_error))
mock_version.assert_called_once_with(self.version)
mock_dispatch.assert_called_once_with(
self.step['GenericHardwareManager'][0]['step'],
self.node, self.ports)
mock_cache_node.assert_called_once_with(self.node)
@mock.patch('ironic_python_agent.hardware.dispatch_to_managers',
autospec=True)
@mock.patch('ironic_python_agent.hardware.check_versions',
autospec=True)
def test_execute_service_step_version_mismatch(self, mock_version,
mock_dispatch,
mock_cache_node):
mock_version.side_effect = errors.VersionMismatch(
{'GenericHardwareManager': 1}, {'GenericHardwareManager': 2})
async_result = self.agent_extension.execute_service_step(
step=self.step['GenericHardwareManager'][0], node=self.node,
ports=self.ports, service_version=self.version)
async_result.join()
# NOTE(TheJulia): This remains CLEAN_VERSION_MISMATCH for backwards
# compatability with base.py and API consumers.
self.assertEqual('CLEAN_VERSION_MISMATCH',
async_result.command_status)
mock_version.assert_called_once_with(self.version)

View File

@ -211,6 +211,10 @@ class TestGenericHardwareManager(base.IronicAgentTest):
for step in self.hardware.get_deploy_steps(self.node, []):
getattr(self.hardware, step['step'])
def test_service_steps_exist(self):
for step in self.hardware.get_service_steps(self.node, []):
getattr(self.hardware, step['step'])
@mock.patch('binascii.hexlify', autospec=True)
@mock.patch('ironic_python_agent.netutils.get_lldp_info', autospec=True)
def test_collect_lldp_data(self, mock_lldp_info, mock_hexlify):

View File

@ -0,0 +1,14 @@
---
features:
- |
Adds a new ``service`` extension which facilitates command handling for
Ironic to retrieve a list of service steps.
- Adds a new base method to base HardwareManager, ``get_service_steps``
which works the same as ``get_clean_steps`` and ``get_deploy_steps``.
These methods can be extended by hardware managers to permit them to
signal what steps are permitted.
- Extends reasonable deploy/clean steps to also be service steps which
are embedded in the Ironic agent. For example, CPU, Network, and Memory
burnin steps are available as service steps, but not the disk burnin
step as that would likely result in the existing disk contents being
damaged.

View File

@ -43,6 +43,7 @@ ironic_python_agent.extensions =
log = ironic_python_agent.extensions.log:LogExtension
rescue = ironic_python_agent.extensions.rescue:RescueExtension
poll = ironic_python_agent.extensions.poll:PollExtension
service = ironic_python_agent.extensions.service:ServiceExtension
ironic_python_agent.hardware_managers =
generic = ironic_python_agent.hardware:GenericHardwareManager