Guard shared device/cluster filesystems

Certain filesystems are sometimes used in specialty computing
environments where a shared storage infrastructure or fabric exists.
These filesystems allow for multi-host shared concurrent read/write
access to the underlying block device by *not* locking the entire
device for exclusive use. Generally ranges of the disk are reserved
for each interacting node to write to, and locking schemes are used
to prevent collissions.

These filesystems are common for use cases where high availability
is required or ability for individual computers to collaborate on a
given workload is critical, such as a group of hypervisors supporting
virtual machines because it can allow for nearly seamless transfer
of workload from one machine to another.

Similar technologies are also used for cluster quorum and cluster
durable state sharing, however that is not specifically considered
in scope.

Where things get difficult is becuase the entire device is not
exclusively locked with the storage fabrics, and in some cases locking
is handled by a Distributed Lock Manager on the network, or via special
sector interactions amongst the cluster members which understand
and support the filesystem.

As a reult of this IO/Interaction model, an Ironic-Python-Agent
performing cleaning can effectively destroy the cluster just by
attempting to clean storage which it percieves as attached locally.
This is not IPA's fault, often this case occurs when a Storage
Administrator forgot to update LUN masking or volume settings on
a SAN as it relates to an individual host in the overall
computing environment. The net result of one node cleaning the
shared volume may include restoration from snapshot, backup
storage, or may ultimately cause permenant data loss, depending
on the environment and the usage of that environment.

Included in this patch:
- IBM GPFS - Can be used on a shared block device... apparently according
             to IBM's documentation. The standard use of GPFS is more Ceph
             like in design... however GPFS is also a specially licensed
             commercial offering, so it is a red flag if this is
             encountered, and should be investigated by the environment's
             systems operator.
- Red Hat GFS2 - Is used with shared common block devices in clusters.
- VMware VMFS - Is used with shared SAN block devices, as well as
                local block devices. With shared block devices,
                ranges of the disk are locked instead of the whole
                disk, and the ranges are mapped to virtual machine
                disk interfaces.
                It is unknown, due to lack of information, if this
                will detect and prevent erasure of VMFS logical
                extent volumes.

Co-Authored-by: Jay Faulkner <jay@jvf.cc>
Change-Id: Ic8cade008577516e696893fdbdabf70999c06a5b
Story: 2009978
Task: 44985
This commit is contained in:
Julia Kreger 2022-04-22 09:36:31 -07:00 committed by Jay Faulkner
parent 0bd39c41cf
commit beb7484858
8 changed files with 499 additions and 9 deletions

View File

@ -103,3 +103,23 @@ Clean steps
Delete the RAID configuration. This step belongs to the ``raid`` interface
and must be used through the :ironic-doc:`ironic RAID feature
<admin/raid.html>`.
Cleaning safeguards
-------------------
The stock hardware manager contains a number of safeguards to prevent
unsafe conditions from occuring.
Shared Disk Cluster Filesystems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Commonly used shared disk cluster filesystems, when detected, causes cleaning
processes on stock hardware manager methods to abort prior to destroying the
contents on the disk.
These filesystems include IBM General Parallel File System (GPFS),
VmWare Virtual Machine File System (VMFS), and Red Hat Global File System
(GFS2).
For information on troubleshooting, and disabling this check,
see :doc:`/admin/troubleshooting`.

View File

@ -24,7 +24,7 @@ image boots. Kernel command line parameters are used to do this.
dynamic-login element example:
- Add ``sshkey="ssh-rsa BBA1..."`` to pxe_append_params setting in
- Add ``sshkey="ssh-rsa BBA1..."`` to kernel_append_params setting in
the ``ironic.conf`` file
- Restart the ironic-conductor with the command
``service ironic-conductor restart``
@ -76,7 +76,7 @@ are used to do this.
dynamic-login element example::
Generate a ENCRYPTED_PASSWORD with the openssl passwd -1 command
Add rootpwd="$ENCRYPTED_PASSWORD" value on the pxe_append_params setting in /etc/ironic/ironic.conf
Add rootpwd="$ENCRYPTED_PASSWORD" value on the kernel_append_params setting in /etc/ironic/ironic.conf
Restart the ironic-conductor with the command service ironic-conductor restart
Users can also be added to DIB built IPA images with the devuser element [1]_
@ -118,7 +118,7 @@ Set IPA to debug logging
Debug logging can be enabled a several different ways. The easiest way is to
add ``ipa-debug=1`` to the kernel command line. To do this:
- Append ``ipa-debug=1`` to the pxe_append_params setting in the
- Append ``ipa-debug=1`` to the kernel_append_params setting in the
``ironic.conf`` file
- Restart the ironic-conductor with the command
``service ironic-conductor restart``
@ -161,6 +161,96 @@ If the system uses systemd then systemctl can be used to restart the service::
sudo systemctl restart ironic-python-agent.service
Cleaning halted with ProtectedDeviceError
=========================================
The IPA service has halted cleaning as one of the block devices within or
attached to the bare metal node contains a class of filesystem which **MAY**
cause irreparable harm to a potentially running cluster if accidently removed.
These filesystems *may* be used for only local storage and as a result be
safe to erase. However if a shared block device is in use, such as a device
supplied via a Storage Area Network utilizing protocols such as iSCSI or
FibreChannel. Ultimately the Host Bus Adapter (HBA) may not be an entirely
"detectable" entity given the hardware market place and aspects such as
"SmartNICs" and Converged Network Adapters with specific offload functions
to support standards like "NVMe over Fabric" (NVMe-oF).
By default, the agent will prevent these filesystems from being deleted and
will halt the cleaning process when detected. The cleaning process can be
re-triggered via Ironic's state machine once one of the documented settings
have been used to notify the agent that no action is required.
What filesystems are looked for
-------------------------------
+-------------------------------------------+
| IBM General Parallel Filesystem |
+-------------------------------------------+
| Red Hat Global Filesystem 2 |
+-------------------------------------------+
| VmWare Virtual Machine FileSystem (VMFS) |
+-------------------------------------------+
I'm okay with deleting, how do I tell IPA to clean the disk(s)?
---------------------------------------------------------------
Four potential ways exist to signal to IPA. Please note, all of these options
require access either to the ndoe in Ironic's API or ability to modify Ironic
configuration.
Via Ironic
~~~~~~~~~~
.. note:: This option requires that the version of Ironic be sufficient enough
to understand and explicitly provide this option to the Agent.
Inform Ironic to provide the option to the Agent::
baremetal node set --driver-info wipe_special_filesystems=True
Via a node's kernel_append_params setting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This may be set on a node level by utilizing the override
``kernel_append_params`` setting which can be utilized on a node
level. Example::
baremetal node set --driver-info kernel_append_params="ipa-guard-special-filesystems=False"
Alternatively, if you wish to set this only once, you may use
the ``instance_info`` field, which is wiped upon teardown of the node.
Example::
baremetal node set --instance-info kernel_append_params="ipa-guard-special-filesystems=False"
Via Ironic's Boot time PXE parameters (Globally)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Globally, this setting may be passed by modifying the ``ironic.conf``
configuration file on your cluster by adding
``ipa-guard-special-filesystems=False`` string to the
``[pxe]kernel_append_params`` parameter.
.. warning::
If your running a multi-conductor deployment, all of your ``ironic.conf``
configuration files will need to be updated to match.
Via Ramdisk configuration
~~~~~~~~~~~~~~~~~~~~~~~~~
This option requires modifying the ramdisk, and is the most complex, but may
be advisable if you have a mixed environment cluster where shared clustered
filesystems may be a concern on some machines, but not others.
.. warning::
This requires rebuilding your agent ramdisk, and modifying the embedded
configuration file for the ironic-python-agent. If your confused at all
by this statement, this option is not for you.
Edit /etc/ironic_python_agent/ironic_python_agent.conf and set the parameter
``[DEFAULT]guard_special_filesystems`` to ``False``.
References
==========

View File

@ -10,6 +10,13 @@ deploying your own hardware manager.
IPA ships with :doc:`GenericHardwareManager </admin/hardware_managers>`, which
implements basic cleaning and deployment methods compatible with most hardware.
.. warning::
Some functionality inherent in the stock hardware manager cleaning methods
may be useful in custom hardware managers, but should not be inherently
expected to also work in custom managers. Examples of this are clustered
filesystem protections, and cleaning method fallback logic. Custom hardware
manager maintainers should be mindful when overriding the stock methods.
How are methods executed on HardwareManagers?
---------------------------------------------
Methods that modify hardware are dispatched to each hardware manager in
@ -149,6 +156,18 @@ function with extra parameters.
be set to whichever manager has a higher hardware support level and then
use the higher priority in the case of a tie.
In some cases, it may be necessary to create a customized cleaning step to
take a particular pattern of behavior. Those doing such work may want to
leverage file system safety checks, which are part of the stock hardware
managers.
.. code-block:: python
def custom_erase_devices(self, node, ports):
for dev in determine_my_devices_to_erase():
hardware.safety_check_block_device(node, dev.name)
my_special_cleaning(dev.name)
Custom HardwareManagers and Deploying
-------------------------------------

View File

@ -315,6 +315,17 @@ cli_opts = [
min=0, max=99, # 100 is when IPA is booted
help='Priority of the inject_files deploy step (disabled '
'by default), an integer between 1 and .'),
cfg.BoolOpt('guard-special-filesystems',
default=APARAMS.get('ipa-guard-special-filesystems', True),
help='Guard "special" shared device filesystems from '
'cleaning by the stock hardware manager\'s cleaning '
'methods. If one of these filesystems is detected '
'during cleaning, the cleaning process will be aborted '
'and infrastructure operator intervention may be '
'required as this option is intended to prevent '
'cleaning from inadvertently destroying a running '
'cluster which may be visible over a storage fabric '
'such as FibreChannel.'),
]
CONF.register_cli_opts(cli_opts)

View File

@ -348,3 +348,22 @@ class HeartbeatConnectionError(IronicAPIError):
def __init__(self, details):
super(HeartbeatConnectionError, self).__init__(details)
class ProtectedDeviceError(CleaningError):
"""Error raised when a cleaning is halted due to a protected device."""
message = 'Protected device located, cleaning aborted.'
def __init__(self, device, what):
details = ('Protected %(what)s located on device %(device)s. '
'This volume or contents may be a shared block device. '
'Please consult your storage administrator, and restart '
'cleaning after either detaching the volumes, or '
'instructing IPA to not protect contents. See Ironic '
'Python Agent documentation for more information.' %
{'what': what,
'device': device})
self.message = details
super(CleaningError, self).__init__(details)

View File

@ -924,6 +924,10 @@ class HardwareManager(object, metaclass=abc.ABCMeta):
:param node: Ironic node object
:param ports: list of Ironic port objects
:raises: ProtectedDeviceFound if a device has been identified which
may require manual intervention due to the contents and
operational risk which exists as it could also be a sign
of an environmental misconfiguration.
:return: a dictionary in the form {device.name: erasure output}
"""
erase_results = {}
@ -937,6 +941,7 @@ class HardwareManager(object, metaclass=abc.ABCMeta):
thread_pool = ThreadPool(min(max_pool_size, len(block_devices)))
for block_device in block_devices:
params = {'node': node, 'block_device': block_device}
safety_check_block_device(node, block_device.name)
erase_results[block_device.name] = thread_pool.apply_async(
dispatch_to_managers, ('erase_block_device',), params)
thread_pool.close()
@ -1541,6 +1546,10 @@ class GenericHardwareManager(HardwareManager):
:param ports: list of Ironic port objects
:raises BlockDeviceEraseError: when there's an error erasing the
block device
:raises: ProtectedDeviceFound if a device has been identified which
may require manual intervention due to the contents and
operational risk which exists as it could also be a sign
of an environmental misconfiguration.
"""
block_devices = self.list_block_devices(include_partitions=True)
# NOTE(coreywright): Reverse sort by device name so a partition (eg
@ -1549,6 +1558,7 @@ class GenericHardwareManager(HardwareManager):
block_devices.sort(key=lambda dev: dev.name, reverse=True)
erase_errors = {}
for dev in self._list_erasable_devices():
safety_check_block_device(node, dev.name)
try:
disk_utils.destroy_disk_metadata(dev.name, node['uuid'])
except processutils.ProcessExecutionError as e:
@ -1572,6 +1582,10 @@ class GenericHardwareManager(HardwareManager):
:param ports: list of Ironic port objects
:raises BlockDeviceEraseError: when there's an error erasing the
block device
:raises: ProtectedDeviceFound if a device has been identified which
may require manual intervention due to the contents and
operational risk which exists as it could also be a sign
of an environmental misconfiguration.
"""
erase_errors = {}
info = node.get('driver_internal_info', {})
@ -1579,6 +1593,7 @@ class GenericHardwareManager(HardwareManager):
LOG.debug("No erasable devices have been found.")
return
for dev in self._list_erasable_devices():
safety_check_block_device(node, dev.name)
try:
if self._is_nvme(dev):
execute_nvme_erase = info.get(
@ -2957,3 +2972,109 @@ def get_multipath_status():
# as if we directly try and work with the global var, we will be racing
# tests endlessly.
return MULTIPATH_ENABLED
def safety_check_block_device(node, device):
"""Performs safety checking of a block device before destroying.
In order to guard against distruction of file systems such as
shared-disk file systems
(https://en.wikipedia.org/wiki/Clustered_file_system#SHARED-DISK)
or similar filesystems where multiple distinct computers may have
unlocked concurrent IO access to the entire block device or
SAN Logical Unit Number, we need to evaluate, and block cleaning
from occuring on these filesystems *unless* we have been explicitly
configured to do so.
This is because cleaning is an intentionally distructive operation,
and once started against such a device, given the complexities of
shared disk clustered filesystems where concurrent access is a design
element, in all likelihood the entire cluster can be negatively
impacted, and an operator will be forced to recover from snapshot and
or backups of the volume's contents.
:param node: A node, or cached node object.
:param device: String representing the path to the block
device to be checked.
:raises: ProtectedDeviceError when a device is identified with
one of these known clustered filesystems, and the overall
settings have not indicated for the agent to skip such
safety checks.
"""
# NOTE(TheJulia): While this seems super rare, I found out after this
# thread of discussion started that I have customers which have done
# this and wiped out SAN volumes and their contents unintentionally
# as a result of these filesystems not being guarded.
# For those not familiar with shared disk clustered filesystems, think
# of it as like your impacting a Ceph cluster, except your suddenly
# removing the underlying disks from the OSD, and the entire cluster
# goes down.
if not CONF.guard_special_filesystems:
return
di_info = node.get('driver_internal_info', {})
if not di_info.get('wipe_special_filesystems', True):
return
report, _e = il_utils.execute('lsblk', '-Pbia',
'-oFSTYPE,UUID,PTUUID,PARTTYPE,PARTUUID',
device)
lines = report.splitlines()
identified_fs_types = []
identified_ids = []
for line in lines:
device = {}
# Split into KEY=VAL pairs
vals = shlex.split(line)
if not vals:
continue
for key, val in (v.split('=', 1) for v in vals):
if key == 'FSTYPE':
identified_fs_types.append(val)
if key in ['UUID', 'PTUUID', 'PARTTYPE', 'PARTUUID']:
identified_ids.append(val)
# Ignore block types not specified
_check_for_special_partitions_filesystems(
device,
identified_ids,
identified_fs_types)
def _check_for_special_partitions_filesystems(device, ids, fs_types):
"""Compare supplied IDs, Types to known items, and raise if found.
:param device: The block device in use, specificially for logging.
:param ids: A list above IDs found to check.
:param fs_types: A list of FS types found to check.
:raises: ProtectedDeviceError should a partition label or metadata
be discovered which suggests a shared disk clustered filesystem
has been discovered.
"""
guarded_ids = {
# Apparently GPFS can used shared volumes....
'37AFFC90-EF7D-4E96-91C3-2D7AE055B174': 'IBM GPFS Partition',
# Shared volume parallel filesystem
'AA31E02A-400F-11DB-9590-000C2911D1B8': 'VMware VMFS Partition (GPT)',
'0xfb': 'VMware VMFS Partition (MBR)',
}
for key, value in guarded_ids.items():
for id_value in ids:
if key == id_value:
raise errors.ProtectedDeviceError(
device=device,
what=value)
guarded_fs_types = {
'gfs2': 'Red Hat Global File System 2',
}
for key, value in guarded_fs_types.items():
for fs in fs_types:
if key == fs:
raise errors.ProtectedDeviceError(
device=device,
what=value)

View File

@ -1710,10 +1710,12 @@ class TestGenericHardwareManager(base.IronicAgentTest):
mocked_listdir.assert_has_calls(expected_calls)
mocked_mpath.assert_called_once_with()
@mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
@mock.patch.object(hardware, 'ThreadPool', autospec=True)
@mock.patch.object(hardware, 'dispatch_to_managers', autospec=True)
def test_erase_devices_no_parallel_by_default(self, mocked_dispatch,
mock_threadpool):
mock_threadpool,
mock_safety_check):
# NOTE(TheJulia): This test was previously more elaborate and
# had a high failure rate on py37 and py38. So instead, lets just
@ -1732,10 +1734,43 @@ class TestGenericHardwareManager(base.IronicAgentTest):
calls = [mock.call(1)]
self.hardware.erase_devices({}, [])
mock_threadpool.assert_has_calls(calls)
mock_safety_check.assert_has_calls([
mock.call({}, '/dev/sdj'),
mock.call({}, '/dev/hdaa')
])
@mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
@mock.patch.object(hardware, 'ThreadPool', autospec=True)
@mock.patch.object(hardware, 'dispatch_to_managers', autospec=True)
def test_erase_devices_no_parallel_by_default_protected_device(
self, mocked_dispatch,
mock_threadpool,
mock_safety_check):
mock_safety_check.side_effect = errors.ProtectedDeviceError(
device='foo',
what='bar')
self.hardware.list_block_devices = mock.Mock()
self.hardware.list_block_devices.return_value = [
hardware.BlockDevice('/dev/sdj', 'big', 1073741824, True),
hardware.BlockDevice('/dev/hdaa', 'small', 65535, False),
]
calls = [mock.call(1)]
self.assertRaises(errors.ProtectedDeviceError,
self.hardware.erase_devices, {}, [])
mock_threadpool.assert_has_calls(calls)
mock_safety_check.assert_has_calls([
mock.call({}, '/dev/sdj'),
])
mock_threadpool.apply_async.assert_not_called()
@mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
@mock.patch('multiprocessing.pool.ThreadPool.apply_async', autospec=True)
@mock.patch.object(hardware, 'dispatch_to_managers', autospec=True)
def test_erase_devices_concurrency(self, mocked_dispatch, mocked_async):
def test_erase_devices_concurrency(self, mocked_dispatch, mocked_async,
mock_safety_check):
internal_info = self.node['driver_internal_info']
internal_info['disk_erasure_concurrency'] = 10
mocked_dispatch.return_value = 'erased device'
@ -1760,9 +1795,15 @@ class TestGenericHardwareManager(base.IronicAgentTest):
for dev in (blkdev1, blkdev2)]
mocked_async.assert_has_calls(calls)
self.assertEqual(expected, result)
mock_safety_check.assert_has_calls([
mock.call(self.node, '/dev/sdj'),
mock.call(self.node, '/dev/hdaa'),
])
@mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
@mock.patch.object(hardware, 'ThreadPool', autospec=True)
def test_erase_devices_concurrency_pool_size(self, mocked_pool):
def test_erase_devices_concurrency_pool_size(self, mocked_pool,
mock_safety_check):
self.hardware.list_block_devices = mock.Mock()
self.hardware.list_block_devices.return_value = [
hardware.BlockDevice('/dev/sdj', 'big', 1073741824, True),
@ -1782,6 +1823,10 @@ class TestGenericHardwareManager(base.IronicAgentTest):
self.hardware.erase_devices(self.node, [])
mocked_pool.assert_called_with(1)
mock_safety_check.assert_has_calls([
mock.call(self.node, '/dev/sdj'),
mock.call(self.node, '/dev/hdaa'),
])
@mock.patch.object(hardware, 'dispatch_to_managers', autospec=True)
def test_erase_devices_without_disk(self, mocked_dispatch):
@ -2441,6 +2486,7 @@ class TestGenericHardwareManager(base.IronicAgentTest):
mock.call('/sys/fs/pstore/' + arg) for arg in pstore_entries
])
@mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
@mock.patch.object(il_utils, 'execute', autospec=True)
@mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
@mock.patch.object(hardware.GenericHardwareManager,
@ -2449,7 +2495,7 @@ class TestGenericHardwareManager(base.IronicAgentTest):
'_list_erasable_devices', autospec=True)
def test_erase_devices_express(
self, mock_list_erasable_devices, mock_nvme_erase,
mock_destroy_disk_metadata, mock_execute):
mock_destroy_disk_metadata, mock_execute, mock_safety_check):
block_devices = [
hardware.BlockDevice('/dev/sda', 'sata', 65535, False),
hardware.BlockDevice('/dev/md0', 'raid-device', 32767, False),
@ -2466,7 +2512,44 @@ class TestGenericHardwareManager(base.IronicAgentTest):
mock.call('/dev/md0', self.node['uuid'])],
mock_destroy_disk_metadata.call_args_list)
mock_list_erasable_devices.assert_called_with(self.hardware)
mock_safety_check.assert_has_calls([
mock.call(self.node, '/dev/sda'),
mock.call(self.node, '/dev/md0'),
mock.call(self.node, '/dev/nvme0n1'),
mock.call(self.node, '/dev/nvme1n1')
])
@mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
@mock.patch.object(il_utils, 'execute', autospec=True)
@mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
@mock.patch.object(hardware.GenericHardwareManager,
'_nvme_erase', autospec=True)
@mock.patch.object(hardware.GenericHardwareManager,
'_list_erasable_devices', autospec=True)
def test_erase_devices_express_stops_on_safety_failure(
self, mock_list_erasable_devices, mock_nvme_erase,
mock_destroy_disk_metadata, mock_execute, mock_safety_check):
mock_safety_check.side_effect = errors.ProtectedDeviceError(
device='foo',
what='bar')
block_devices = [
hardware.BlockDevice('/dev/sda', 'sata', 65535, False),
hardware.BlockDevice('/dev/md0', 'raid-device', 32767, False),
hardware.BlockDevice('/dev/nvme0n1', 'nvme', 32767, False),
hardware.BlockDevice('/dev/nvme1n1', 'nvme', 32767, False)
]
mock_list_erasable_devices.return_value = list(block_devices)
self.assertRaises(errors.ProtectedDeviceError,
self.hardware.erase_devices_express, self.node, [])
mock_nvme_erase.assert_not_called()
mock_destroy_disk_metadata.assert_not_called()
mock_list_erasable_devices.assert_called_with(self.hardware)
mock_safety_check.assert_has_calls([
mock.call(self.node, '/dev/sda')
])
@mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
@mock.patch.object(il_utils, 'execute', autospec=True)
@mock.patch.object(hardware.GenericHardwareManager,
'_is_virtual_media_device', autospec=True)
@ -2475,7 +2558,7 @@ class TestGenericHardwareManager(base.IronicAgentTest):
@mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
def test_erase_devices_metadata(
self, mock_metadata, mock_list_devs, mock__is_vmedia,
mock_execute):
mock_execute, mock_safety_check):
block_devices = [
hardware.BlockDevice('/dev/sr0', 'vmedia', 12345, True),
hardware.BlockDevice('/dev/sdb2', 'raid-member', 32767, False),
@ -2509,7 +2592,62 @@ class TestGenericHardwareManager(base.IronicAgentTest):
mock.call(self.hardware, block_devices[2]),
mock.call(self.hardware, block_devices[5])],
mock__is_vmedia.call_args_list)
mock_safety_check.assert_has_calls([
mock.call(self.node, '/dev/sda1'),
mock.call(self.node, '/dev/sda'),
# This is kind of redundant code/pattern behavior wise
# but you never know what someone has done...
mock.call(self.node, '/dev/md0')
])
@mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
@mock.patch.object(il_utils, 'execute', autospec=True)
@mock.patch.object(hardware.GenericHardwareManager,
'_is_virtual_media_device', autospec=True)
@mock.patch.object(hardware.GenericHardwareManager,
'list_block_devices', autospec=True)
@mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
def test_erase_devices_metadata_safety_check(
self, mock_metadata, mock_list_devs, mock__is_vmedia,
mock_execute, mock_safety_check):
block_devices = [
hardware.BlockDevice('/dev/sr0', 'vmedia', 12345, True),
hardware.BlockDevice('/dev/sdb2', 'raid-member', 32767, False),
hardware.BlockDevice('/dev/sda', 'small', 65535, False),
hardware.BlockDevice('/dev/sda1', '', 32767, False),
hardware.BlockDevice('/dev/sda2', 'raid-member', 32767, False),
hardware.BlockDevice('/dev/md0', 'raid-device', 32767, False)
]
# NOTE(coreywright): Don't return the list, but a copy of it, because
# we depend on its elements' order when referencing it later during
# verification, but the method under test sorts the list changing it.
mock_list_devs.return_value = list(block_devices)
mock__is_vmedia.side_effect = lambda _, dev: dev.name == '/dev/sr0'
mock_execute.side_effect = [
('sdb2 linux_raid_member host:1 f9978968', ''),
('sda2 linux_raid_member host:1 f9978969', ''),
('sda1', ''), ('sda', ''), ('md0', '')]
mock_safety_check.side_effect = [
None,
errors.ProtectedDeviceError(
device='foo',
what='bar')
]
self.assertRaises(errors.ProtectedDeviceError,
self.hardware.erase_devices_metadata,
self.node, [])
self.assertEqual([mock.call('/dev/sda1', self.node['uuid'])],
mock_metadata.call_args_list)
mock_list_devs.assert_called_with(self.hardware,
include_partitions=True)
mock_safety_check.assert_has_calls([
mock.call(self.node, '/dev/sda1'),
mock.call(self.node, '/dev/sda'),
])
@mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
@mock.patch.object(hardware.GenericHardwareManager,
'_is_linux_raid_member', autospec=True)
@mock.patch.object(hardware.GenericHardwareManager,
@ -2519,7 +2657,7 @@ class TestGenericHardwareManager(base.IronicAgentTest):
@mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
def test_erase_devices_metadata_error(
self, mock_metadata, mock_list_devs, mock__is_vmedia,
mock__is_raid_member):
mock__is_raid_member, mock_safety_check):
block_devices = [
hardware.BlockDevice('/dev/sda', 'small', 65535, False),
hardware.BlockDevice('/dev/sdb', 'big', 10737418240, True),
@ -2553,6 +2691,10 @@ class TestGenericHardwareManager(base.IronicAgentTest):
self.assertEqual([mock.call(self.hardware, block_devices[1]),
mock.call(self.hardware, block_devices[0])],
mock__is_vmedia.call_args_list)
mock_safety_check.assert_has_calls([
mock.call(self.node, '/dev/sdb'),
mock.call(self.node, '/dev/sda')
])
@mock.patch.object(il_utils, 'execute', autospec=True)
def test__is_linux_raid_member(self, mocked_execute):
@ -4991,3 +5133,51 @@ class TestAPIClientSaveAndUse(base.IronicAgentTest):
calls = [mock.call('list_hardware_info'),
mock.call('wait_for_disks')]
mock_dispatch.assert_has_calls(calls)
@mock.patch.object(il_utils, 'execute', autospec=True)
class TestProtectedDiskSafetyChecks(base.IronicAgentTest):
def test_special_filesystem_guard_not_enabled(self, mock_execute):
CONF.set_override('guard_special_filesystems', False)
hardware.safety_check_block_device({}, '/dev/foo')
mock_execute.assert_not_called()
def test_special_filesystem_guard_node_indicates_skip(self, mock_execute):
node = {
'driver_internal_info': {
'wipe_special_filesystems': False
}
}
mock_execute.return_value = ('', '')
hardware.safety_check_block_device(node, '/dev/foo')
mock_execute.assert_not_called()
def test_special_filesystem_guard_enabled_no_results(self, mock_execute):
mock_execute.return_value = ('', '')
hardware.safety_check_block_device({}, '/dev/foo')
def test_special_filesystem_guard_raises(self, mock_execute):
GFS2 = 'FSTYPE="gfs2"\n'
GPFS1 = 'UUID="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n'
GPFS2 = 'PTUUID="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n'
GPFS3 = 'PARTTYPE="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n'
GPFS4 = 'PARTUUID="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n'
VMFS1 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n'
VMFS2 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n'
VMFS3 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n'
VMFS4 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n'
VMFS5 = 'UUID="0xfb"\n'
VMFS6 = 'PTUUID="0xfb"\n'
VMFS7 = 'PARTTYPE="0xfb"\n'
VMFS8 = 'PARTUUID="0xfb"\n'
expected_failures = [GFS2, GPFS1, GPFS2, GPFS3, GPFS4, VMFS1, VMFS2,
VMFS3, VMFS4, VMFS5, VMFS6, VMFS7, VMFS8]
for failure in expected_failures:
mock_execute.reset_mock()
mock_execute.return_value = (failure, '')
self.assertRaises(errors.ProtectedDeviceError,
hardware.safety_check_block_device,
{}, '/dev/foo')
self.assertEqual(1, mock_execute.call_count)

View File

@ -0,0 +1,20 @@
---
fixes:
- |
Previously when the ``ironic-python-agent`` would undergo erasure of block
devices during cleaning, it would automatically attempt to erase the
contents of any "Shared Device" clustered filesystems which may be in use
by distinct multiple machines over a storage fabric. In particular
IBM GPFS, Red Hat Global File System 2, and VmWare Virtual Machine File
System (VMFS), are now identified and cleaning is halted. This is important
because should an active cluster be using the this disk, cleaning could
potentially cause the cluster to go down forcing restoration from backups.
Ideally, infrastructure operators should check their environment's storage
configuration and un-map any clustered filesystems from being visible to
Ironic nodes, unless explicitly needed and expected. Please see the
Ironic-Python-Agent troubleshooting documentation for more information.
issues:
- |
Logic to guard VMFS filesystems from being destroyed *may* not recognize
VMFS extents. Operators with examples of partitioning for extent usage
are encouraged to contact the Ironic community.