diff --git a/doc/source/admin/hardware_managers.rst b/doc/source/admin/hardware_managers.rst index 290fcf0f2..e51093f0b 100644 --- a/doc/source/admin/hardware_managers.rst +++ b/doc/source/admin/hardware_managers.rst @@ -103,3 +103,23 @@ Clean steps Delete the RAID configuration. This step belongs to the ``raid`` interface and must be used through the :ironic-doc:`ironic RAID feature `. + +Cleaning safeguards +------------------- + +The stock hardware manager contains a number of safeguards to prevent +unsafe conditions from occuring. + +Shared Disk Cluster Filesystems +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Commonly used shared disk cluster filesystems, when detected, causes cleaning +processes on stock hardware manager methods to abort prior to destroying the +contents on the disk. + +These filesystems include IBM General Parallel File System (GPFS), +VmWare Virtual Machine File System (VMFS), and Red Hat Global File System +(GFS2). + +For information on troubleshooting, and disabling this check, +see :doc:`/admin/troubleshooting`. diff --git a/doc/source/admin/troubleshooting.rst b/doc/source/admin/troubleshooting.rst index 6e1a81c89..9e52f7d79 100644 --- a/doc/source/admin/troubleshooting.rst +++ b/doc/source/admin/troubleshooting.rst @@ -24,7 +24,7 @@ image boots. Kernel command line parameters are used to do this. dynamic-login element example: -- Add ``sshkey="ssh-rsa BBA1..."`` to pxe_append_params setting in +- Add ``sshkey="ssh-rsa BBA1..."`` to kernel_append_params setting in the ``ironic.conf`` file - Restart the ironic-conductor with the command ``service ironic-conductor restart`` @@ -76,7 +76,7 @@ are used to do this. dynamic-login element example:: Generate a ENCRYPTED_PASSWORD with the openssl passwd -1 command - Add rootpwd="$ENCRYPTED_PASSWORD" value on the pxe_append_params setting in /etc/ironic/ironic.conf + Add rootpwd="$ENCRYPTED_PASSWORD" value on the kernel_append_params setting in /etc/ironic/ironic.conf Restart the ironic-conductor with the command service ironic-conductor restart Users can also be added to DIB built IPA images with the devuser element [1]_ @@ -118,7 +118,7 @@ Set IPA to debug logging Debug logging can be enabled a several different ways. The easiest way is to add ``ipa-debug=1`` to the kernel command line. To do this: -- Append ``ipa-debug=1`` to the pxe_append_params setting in the +- Append ``ipa-debug=1`` to the kernel_append_params setting in the ``ironic.conf`` file - Restart the ironic-conductor with the command ``service ironic-conductor restart`` @@ -161,6 +161,96 @@ If the system uses systemd then systemctl can be used to restart the service:: sudo systemctl restart ironic-python-agent.service +Cleaning halted with ProtectedDeviceError +========================================= + +The IPA service has halted cleaning as one of the block devices within or +attached to the bare metal node contains a class of filesystem which **MAY** +cause irreparable harm to a potentially running cluster if accidently removed. + +These filesystems *may* be used for only local storage and as a result be +safe to erase. However if a shared block device is in use, such as a device +supplied via a Storage Area Network utilizing protocols such as iSCSI or +FibreChannel. Ultimately the Host Bus Adapter (HBA) may not be an entirely +"detectable" entity given the hardware market place and aspects such as +"SmartNICs" and Converged Network Adapters with specific offload functions +to support standards like "NVMe over Fabric" (NVMe-oF). + +By default, the agent will prevent these filesystems from being deleted and +will halt the cleaning process when detected. The cleaning process can be +re-triggered via Ironic's state machine once one of the documented settings +have been used to notify the agent that no action is required. + +What filesystems are looked for +------------------------------- + ++-------------------------------------------+ +| IBM General Parallel Filesystem | ++-------------------------------------------+ +| Red Hat Global Filesystem 2 | ++-------------------------------------------+ +| VmWare Virtual Machine FileSystem (VMFS) | ++-------------------------------------------+ + +I'm okay with deleting, how do I tell IPA to clean the disk(s)? +--------------------------------------------------------------- + +Four potential ways exist to signal to IPA. Please note, all of these options +require access either to the ndoe in Ironic's API or ability to modify Ironic +configuration. + +Via Ironic +~~~~~~~~~~ + +.. note:: This option requires that the version of Ironic be sufficient enough + to understand and explicitly provide this option to the Agent. + +Inform Ironic to provide the option to the Agent:: + + baremetal node set --driver-info wipe_special_filesystems=True + +Via a node's kernel_append_params setting +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This may be set on a node level by utilizing the override +``kernel_append_params`` setting which can be utilized on a node +level. Example:: + + baremetal node set --driver-info kernel_append_params="ipa-guard-special-filesystems=False" + +Alternatively, if you wish to set this only once, you may use +the ``instance_info`` field, which is wiped upon teardown of the node. +Example:: + + baremetal node set --instance-info kernel_append_params="ipa-guard-special-filesystems=False" + +Via Ironic's Boot time PXE parameters (Globally) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Globally, this setting may be passed by modifying the ``ironic.conf`` +configuration file on your cluster by adding +``ipa-guard-special-filesystems=False`` string to the +``[pxe]kernel_append_params`` parameter. + +.. warning:: + If your running a multi-conductor deployment, all of your ``ironic.conf`` + configuration files will need to be updated to match. + +Via Ramdisk configuration +~~~~~~~~~~~~~~~~~~~~~~~~~ + +This option requires modifying the ramdisk, and is the most complex, but may +be advisable if you have a mixed environment cluster where shared clustered +filesystems may be a concern on some machines, but not others. + +.. warning:: + This requires rebuilding your agent ramdisk, and modifying the embedded + configuration file for the ironic-python-agent. If your confused at all + by this statement, this option is not for you. + +Edit /etc/ironic_python_agent/ironic_python_agent.conf and set the parameter +``[DEFAULT]guard_special_filesystems`` to ``False``. + References ========== diff --git a/doc/source/contributor/hardware_managers.rst b/doc/source/contributor/hardware_managers.rst index ee3d63576..f0f78f7d5 100644 --- a/doc/source/contributor/hardware_managers.rst +++ b/doc/source/contributor/hardware_managers.rst @@ -10,6 +10,13 @@ deploying your own hardware manager. IPA ships with :doc:`GenericHardwareManager `, which implements basic cleaning and deployment methods compatible with most hardware. +.. warning:: + Some functionality inherent in the stock hardware manager cleaning methods + may be useful in custom hardware managers, but should not be inherently + expected to also work in custom managers. Examples of this are clustered + filesystem protections, and cleaning method fallback logic. Custom hardware + manager maintainers should be mindful when overriding the stock methods. + How are methods executed on HardwareManagers? --------------------------------------------- Methods that modify hardware are dispatched to each hardware manager in @@ -149,6 +156,18 @@ function with extra parameters. be set to whichever manager has a higher hardware support level and then use the higher priority in the case of a tie. +In some cases, it may be necessary to create a customized cleaning step to +take a particular pattern of behavior. Those doing such work may want to +leverage file system safety checks, which are part of the stock hardware +managers. + +.. code-block:: python + + def custom_erase_devices(self, node, ports): + for dev in determine_my_devices_to_erase(): + hardware.safety_check_block_device(node, dev.name) + my_special_cleaning(dev.name) + Custom HardwareManagers and Deploying ------------------------------------- diff --git a/ironic_python_agent/config.py b/ironic_python_agent/config.py index 8901c45a2..9251a3e37 100644 --- a/ironic_python_agent/config.py +++ b/ironic_python_agent/config.py @@ -315,6 +315,17 @@ cli_opts = [ min=0, max=99, # 100 is when IPA is booted help='Priority of the inject_files deploy step (disabled ' 'by default), an integer between 1 and .'), + cfg.BoolOpt('guard-special-filesystems', + default=APARAMS.get('ipa-guard-special-filesystems', True), + help='Guard "special" shared device filesystems from ' + 'cleaning by the stock hardware manager\'s cleaning ' + 'methods. If one of these filesystems is detected ' + 'during cleaning, the cleaning process will be aborted ' + 'and infrastructure operator intervention may be ' + 'required as this option is intended to prevent ' + 'cleaning from inadvertently destroying a running ' + 'cluster which may be visible over a storage fabric ' + 'such as FibreChannel.'), ] CONF.register_cli_opts(cli_opts) diff --git a/ironic_python_agent/errors.py b/ironic_python_agent/errors.py index 1f97c0e2e..a0994207f 100644 --- a/ironic_python_agent/errors.py +++ b/ironic_python_agent/errors.py @@ -348,3 +348,22 @@ class HeartbeatConnectionError(IronicAPIError): def __init__(self, details): super(HeartbeatConnectionError, self).__init__(details) + + +class ProtectedDeviceError(CleaningError): + """Error raised when a cleaning is halted due to a protected device.""" + + message = 'Protected device located, cleaning aborted.' + + def __init__(self, device, what): + details = ('Protected %(what)s located on device %(device)s. ' + 'This volume or contents may be a shared block device. ' + 'Please consult your storage administrator, and restart ' + 'cleaning after either detaching the volumes, or ' + 'instructing IPA to not protect contents. See Ironic ' + 'Python Agent documentation for more information.' % + {'what': what, + 'device': device}) + + self.message = details + super(CleaningError, self).__init__(details) diff --git a/ironic_python_agent/hardware.py b/ironic_python_agent/hardware.py index 0f7e4f82d..dfcce6f80 100644 --- a/ironic_python_agent/hardware.py +++ b/ironic_python_agent/hardware.py @@ -924,6 +924,10 @@ class HardwareManager(object, metaclass=abc.ABCMeta): :param node: Ironic node object :param ports: list of Ironic port objects + :raises: ProtectedDeviceFound if a device has been identified which + may require manual intervention due to the contents and + operational risk which exists as it could also be a sign + of an environmental misconfiguration. :return: a dictionary in the form {device.name: erasure output} """ erase_results = {} @@ -937,6 +941,7 @@ class HardwareManager(object, metaclass=abc.ABCMeta): thread_pool = ThreadPool(min(max_pool_size, len(block_devices))) for block_device in block_devices: params = {'node': node, 'block_device': block_device} + safety_check_block_device(node, block_device.name) erase_results[block_device.name] = thread_pool.apply_async( dispatch_to_managers, ('erase_block_device',), params) thread_pool.close() @@ -1541,6 +1546,10 @@ class GenericHardwareManager(HardwareManager): :param ports: list of Ironic port objects :raises BlockDeviceEraseError: when there's an error erasing the block device + :raises: ProtectedDeviceFound if a device has been identified which + may require manual intervention due to the contents and + operational risk which exists as it could also be a sign + of an environmental misconfiguration. """ block_devices = self.list_block_devices(include_partitions=True) # NOTE(coreywright): Reverse sort by device name so a partition (eg @@ -1549,6 +1558,7 @@ class GenericHardwareManager(HardwareManager): block_devices.sort(key=lambda dev: dev.name, reverse=True) erase_errors = {} for dev in self._list_erasable_devices(): + safety_check_block_device(node, dev.name) try: disk_utils.destroy_disk_metadata(dev.name, node['uuid']) except processutils.ProcessExecutionError as e: @@ -1572,6 +1582,10 @@ class GenericHardwareManager(HardwareManager): :param ports: list of Ironic port objects :raises BlockDeviceEraseError: when there's an error erasing the block device + :raises: ProtectedDeviceFound if a device has been identified which + may require manual intervention due to the contents and + operational risk which exists as it could also be a sign + of an environmental misconfiguration. """ erase_errors = {} info = node.get('driver_internal_info', {}) @@ -1579,6 +1593,7 @@ class GenericHardwareManager(HardwareManager): LOG.debug("No erasable devices have been found.") return for dev in self._list_erasable_devices(): + safety_check_block_device(node, dev.name) try: if self._is_nvme(dev): execute_nvme_erase = info.get( @@ -2957,3 +2972,109 @@ def get_multipath_status(): # as if we directly try and work with the global var, we will be racing # tests endlessly. return MULTIPATH_ENABLED + + +def safety_check_block_device(node, device): + """Performs safety checking of a block device before destroying. + + In order to guard against distruction of file systems such as + shared-disk file systems + (https://en.wikipedia.org/wiki/Clustered_file_system#SHARED-DISK) + or similar filesystems where multiple distinct computers may have + unlocked concurrent IO access to the entire block device or + SAN Logical Unit Number, we need to evaluate, and block cleaning + from occuring on these filesystems *unless* we have been explicitly + configured to do so. + + This is because cleaning is an intentionally distructive operation, + and once started against such a device, given the complexities of + shared disk clustered filesystems where concurrent access is a design + element, in all likelihood the entire cluster can be negatively + impacted, and an operator will be forced to recover from snapshot and + or backups of the volume's contents. + + :param node: A node, or cached node object. + :param device: String representing the path to the block + device to be checked. + :raises: ProtectedDeviceError when a device is identified with + one of these known clustered filesystems, and the overall + settings have not indicated for the agent to skip such + safety checks. + """ + + # NOTE(TheJulia): While this seems super rare, I found out after this + # thread of discussion started that I have customers which have done + # this and wiped out SAN volumes and their contents unintentionally + # as a result of these filesystems not being guarded. + # For those not familiar with shared disk clustered filesystems, think + # of it as like your impacting a Ceph cluster, except your suddenly + # removing the underlying disks from the OSD, and the entire cluster + # goes down. + + if not CONF.guard_special_filesystems: + return + di_info = node.get('driver_internal_info', {}) + if not di_info.get('wipe_special_filesystems', True): + return + report, _e = il_utils.execute('lsblk', '-Pbia', + '-oFSTYPE,UUID,PTUUID,PARTTYPE,PARTUUID', + device) + + lines = report.splitlines() + + identified_fs_types = [] + identified_ids = [] + for line in lines: + device = {} + # Split into KEY=VAL pairs + vals = shlex.split(line) + if not vals: + continue + for key, val in (v.split('=', 1) for v in vals): + if key == 'FSTYPE': + identified_fs_types.append(val) + if key in ['UUID', 'PTUUID', 'PARTTYPE', 'PARTUUID']: + identified_ids.append(val) + # Ignore block types not specified + + _check_for_special_partitions_filesystems( + device, + identified_ids, + identified_fs_types) + + +def _check_for_special_partitions_filesystems(device, ids, fs_types): + """Compare supplied IDs, Types to known items, and raise if found. + + :param device: The block device in use, specificially for logging. + :param ids: A list above IDs found to check. + :param fs_types: A list of FS types found to check. + :raises: ProtectedDeviceError should a partition label or metadata + be discovered which suggests a shared disk clustered filesystem + has been discovered. + """ + + guarded_ids = { + # Apparently GPFS can used shared volumes.... + '37AFFC90-EF7D-4E96-91C3-2D7AE055B174': 'IBM GPFS Partition', + # Shared volume parallel filesystem + 'AA31E02A-400F-11DB-9590-000C2911D1B8': 'VMware VMFS Partition (GPT)', + '0xfb': 'VMware VMFS Partition (MBR)', + } + for key, value in guarded_ids.items(): + for id_value in ids: + if key == id_value: + raise errors.ProtectedDeviceError( + device=device, + what=value) + + guarded_fs_types = { + 'gfs2': 'Red Hat Global File System 2', + } + + for key, value in guarded_fs_types.items(): + for fs in fs_types: + if key == fs: + raise errors.ProtectedDeviceError( + device=device, + what=value) diff --git a/ironic_python_agent/tests/unit/test_hardware.py b/ironic_python_agent/tests/unit/test_hardware.py index a610eb2f8..213490505 100644 --- a/ironic_python_agent/tests/unit/test_hardware.py +++ b/ironic_python_agent/tests/unit/test_hardware.py @@ -1710,10 +1710,12 @@ class TestGenericHardwareManager(base.IronicAgentTest): mocked_listdir.assert_has_calls(expected_calls) mocked_mpath.assert_called_once_with() + @mock.patch.object(hardware, 'safety_check_block_device', autospec=True) @mock.patch.object(hardware, 'ThreadPool', autospec=True) @mock.patch.object(hardware, 'dispatch_to_managers', autospec=True) def test_erase_devices_no_parallel_by_default(self, mocked_dispatch, - mock_threadpool): + mock_threadpool, + mock_safety_check): # NOTE(TheJulia): This test was previously more elaborate and # had a high failure rate on py37 and py38. So instead, lets just @@ -1732,10 +1734,43 @@ class TestGenericHardwareManager(base.IronicAgentTest): calls = [mock.call(1)] self.hardware.erase_devices({}, []) mock_threadpool.assert_has_calls(calls) + mock_safety_check.assert_has_calls([ + mock.call({}, '/dev/sdj'), + mock.call({}, '/dev/hdaa') + ]) + @mock.patch.object(hardware, 'safety_check_block_device', autospec=True) + @mock.patch.object(hardware, 'ThreadPool', autospec=True) + @mock.patch.object(hardware, 'dispatch_to_managers', autospec=True) + def test_erase_devices_no_parallel_by_default_protected_device( + self, mocked_dispatch, + mock_threadpool, + mock_safety_check): + mock_safety_check.side_effect = errors.ProtectedDeviceError( + device='foo', + what='bar') + + self.hardware.list_block_devices = mock.Mock() + + self.hardware.list_block_devices.return_value = [ + hardware.BlockDevice('/dev/sdj', 'big', 1073741824, True), + hardware.BlockDevice('/dev/hdaa', 'small', 65535, False), + ] + + calls = [mock.call(1)] + self.assertRaises(errors.ProtectedDeviceError, + self.hardware.erase_devices, {}, []) + mock_threadpool.assert_has_calls(calls) + mock_safety_check.assert_has_calls([ + mock.call({}, '/dev/sdj'), + ]) + mock_threadpool.apply_async.assert_not_called() + + @mock.patch.object(hardware, 'safety_check_block_device', autospec=True) @mock.patch('multiprocessing.pool.ThreadPool.apply_async', autospec=True) @mock.patch.object(hardware, 'dispatch_to_managers', autospec=True) - def test_erase_devices_concurrency(self, mocked_dispatch, mocked_async): + def test_erase_devices_concurrency(self, mocked_dispatch, mocked_async, + mock_safety_check): internal_info = self.node['driver_internal_info'] internal_info['disk_erasure_concurrency'] = 10 mocked_dispatch.return_value = 'erased device' @@ -1760,9 +1795,15 @@ class TestGenericHardwareManager(base.IronicAgentTest): for dev in (blkdev1, blkdev2)] mocked_async.assert_has_calls(calls) self.assertEqual(expected, result) + mock_safety_check.assert_has_calls([ + mock.call(self.node, '/dev/sdj'), + mock.call(self.node, '/dev/hdaa'), + ]) + @mock.patch.object(hardware, 'safety_check_block_device', autospec=True) @mock.patch.object(hardware, 'ThreadPool', autospec=True) - def test_erase_devices_concurrency_pool_size(self, mocked_pool): + def test_erase_devices_concurrency_pool_size(self, mocked_pool, + mock_safety_check): self.hardware.list_block_devices = mock.Mock() self.hardware.list_block_devices.return_value = [ hardware.BlockDevice('/dev/sdj', 'big', 1073741824, True), @@ -1782,6 +1823,10 @@ class TestGenericHardwareManager(base.IronicAgentTest): self.hardware.erase_devices(self.node, []) mocked_pool.assert_called_with(1) + mock_safety_check.assert_has_calls([ + mock.call(self.node, '/dev/sdj'), + mock.call(self.node, '/dev/hdaa'), + ]) @mock.patch.object(hardware, 'dispatch_to_managers', autospec=True) def test_erase_devices_without_disk(self, mocked_dispatch): @@ -2441,6 +2486,7 @@ class TestGenericHardwareManager(base.IronicAgentTest): mock.call('/sys/fs/pstore/' + arg) for arg in pstore_entries ]) + @mock.patch.object(hardware, 'safety_check_block_device', autospec=True) @mock.patch.object(il_utils, 'execute', autospec=True) @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True) @mock.patch.object(hardware.GenericHardwareManager, @@ -2449,7 +2495,7 @@ class TestGenericHardwareManager(base.IronicAgentTest): '_list_erasable_devices', autospec=True) def test_erase_devices_express( self, mock_list_erasable_devices, mock_nvme_erase, - mock_destroy_disk_metadata, mock_execute): + mock_destroy_disk_metadata, mock_execute, mock_safety_check): block_devices = [ hardware.BlockDevice('/dev/sda', 'sata', 65535, False), hardware.BlockDevice('/dev/md0', 'raid-device', 32767, False), @@ -2466,7 +2512,44 @@ class TestGenericHardwareManager(base.IronicAgentTest): mock.call('/dev/md0', self.node['uuid'])], mock_destroy_disk_metadata.call_args_list) mock_list_erasable_devices.assert_called_with(self.hardware) + mock_safety_check.assert_has_calls([ + mock.call(self.node, '/dev/sda'), + mock.call(self.node, '/dev/md0'), + mock.call(self.node, '/dev/nvme0n1'), + mock.call(self.node, '/dev/nvme1n1') + ]) + @mock.patch.object(hardware, 'safety_check_block_device', autospec=True) + @mock.patch.object(il_utils, 'execute', autospec=True) + @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True) + @mock.patch.object(hardware.GenericHardwareManager, + '_nvme_erase', autospec=True) + @mock.patch.object(hardware.GenericHardwareManager, + '_list_erasable_devices', autospec=True) + def test_erase_devices_express_stops_on_safety_failure( + self, mock_list_erasable_devices, mock_nvme_erase, + mock_destroy_disk_metadata, mock_execute, mock_safety_check): + mock_safety_check.side_effect = errors.ProtectedDeviceError( + device='foo', + what='bar') + block_devices = [ + hardware.BlockDevice('/dev/sda', 'sata', 65535, False), + hardware.BlockDevice('/dev/md0', 'raid-device', 32767, False), + hardware.BlockDevice('/dev/nvme0n1', 'nvme', 32767, False), + hardware.BlockDevice('/dev/nvme1n1', 'nvme', 32767, False) + ] + mock_list_erasable_devices.return_value = list(block_devices) + + self.assertRaises(errors.ProtectedDeviceError, + self.hardware.erase_devices_express, self.node, []) + mock_nvme_erase.assert_not_called() + mock_destroy_disk_metadata.assert_not_called() + mock_list_erasable_devices.assert_called_with(self.hardware) + mock_safety_check.assert_has_calls([ + mock.call(self.node, '/dev/sda') + ]) + + @mock.patch.object(hardware, 'safety_check_block_device', autospec=True) @mock.patch.object(il_utils, 'execute', autospec=True) @mock.patch.object(hardware.GenericHardwareManager, '_is_virtual_media_device', autospec=True) @@ -2475,7 +2558,7 @@ class TestGenericHardwareManager(base.IronicAgentTest): @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True) def test_erase_devices_metadata( self, mock_metadata, mock_list_devs, mock__is_vmedia, - mock_execute): + mock_execute, mock_safety_check): block_devices = [ hardware.BlockDevice('/dev/sr0', 'vmedia', 12345, True), hardware.BlockDevice('/dev/sdb2', 'raid-member', 32767, False), @@ -2509,7 +2592,62 @@ class TestGenericHardwareManager(base.IronicAgentTest): mock.call(self.hardware, block_devices[2]), mock.call(self.hardware, block_devices[5])], mock__is_vmedia.call_args_list) + mock_safety_check.assert_has_calls([ + mock.call(self.node, '/dev/sda1'), + mock.call(self.node, '/dev/sda'), + # This is kind of redundant code/pattern behavior wise + # but you never know what someone has done... + mock.call(self.node, '/dev/md0') + ]) + @mock.patch.object(hardware, 'safety_check_block_device', autospec=True) + @mock.patch.object(il_utils, 'execute', autospec=True) + @mock.patch.object(hardware.GenericHardwareManager, + '_is_virtual_media_device', autospec=True) + @mock.patch.object(hardware.GenericHardwareManager, + 'list_block_devices', autospec=True) + @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True) + def test_erase_devices_metadata_safety_check( + self, mock_metadata, mock_list_devs, mock__is_vmedia, + mock_execute, mock_safety_check): + block_devices = [ + hardware.BlockDevice('/dev/sr0', 'vmedia', 12345, True), + hardware.BlockDevice('/dev/sdb2', 'raid-member', 32767, False), + hardware.BlockDevice('/dev/sda', 'small', 65535, False), + hardware.BlockDevice('/dev/sda1', '', 32767, False), + hardware.BlockDevice('/dev/sda2', 'raid-member', 32767, False), + hardware.BlockDevice('/dev/md0', 'raid-device', 32767, False) + ] + # NOTE(coreywright): Don't return the list, but a copy of it, because + # we depend on its elements' order when referencing it later during + # verification, but the method under test sorts the list changing it. + mock_list_devs.return_value = list(block_devices) + mock__is_vmedia.side_effect = lambda _, dev: dev.name == '/dev/sr0' + mock_execute.side_effect = [ + ('sdb2 linux_raid_member host:1 f9978968', ''), + ('sda2 linux_raid_member host:1 f9978969', ''), + ('sda1', ''), ('sda', ''), ('md0', '')] + mock_safety_check.side_effect = [ + None, + errors.ProtectedDeviceError( + device='foo', + what='bar') + ] + + self.assertRaises(errors.ProtectedDeviceError, + self.hardware.erase_devices_metadata, + self.node, []) + + self.assertEqual([mock.call('/dev/sda1', self.node['uuid'])], + mock_metadata.call_args_list) + mock_list_devs.assert_called_with(self.hardware, + include_partitions=True) + mock_safety_check.assert_has_calls([ + mock.call(self.node, '/dev/sda1'), + mock.call(self.node, '/dev/sda'), + ]) + + @mock.patch.object(hardware, 'safety_check_block_device', autospec=True) @mock.patch.object(hardware.GenericHardwareManager, '_is_linux_raid_member', autospec=True) @mock.patch.object(hardware.GenericHardwareManager, @@ -2519,7 +2657,7 @@ class TestGenericHardwareManager(base.IronicAgentTest): @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True) def test_erase_devices_metadata_error( self, mock_metadata, mock_list_devs, mock__is_vmedia, - mock__is_raid_member): + mock__is_raid_member, mock_safety_check): block_devices = [ hardware.BlockDevice('/dev/sda', 'small', 65535, False), hardware.BlockDevice('/dev/sdb', 'big', 10737418240, True), @@ -2553,6 +2691,10 @@ class TestGenericHardwareManager(base.IronicAgentTest): self.assertEqual([mock.call(self.hardware, block_devices[1]), mock.call(self.hardware, block_devices[0])], mock__is_vmedia.call_args_list) + mock_safety_check.assert_has_calls([ + mock.call(self.node, '/dev/sdb'), + mock.call(self.node, '/dev/sda') + ]) @mock.patch.object(il_utils, 'execute', autospec=True) def test__is_linux_raid_member(self, mocked_execute): @@ -4991,3 +5133,51 @@ class TestAPIClientSaveAndUse(base.IronicAgentTest): calls = [mock.call('list_hardware_info'), mock.call('wait_for_disks')] mock_dispatch.assert_has_calls(calls) + + +@mock.patch.object(il_utils, 'execute', autospec=True) +class TestProtectedDiskSafetyChecks(base.IronicAgentTest): + + def test_special_filesystem_guard_not_enabled(self, mock_execute): + CONF.set_override('guard_special_filesystems', False) + hardware.safety_check_block_device({}, '/dev/foo') + mock_execute.assert_not_called() + + def test_special_filesystem_guard_node_indicates_skip(self, mock_execute): + node = { + 'driver_internal_info': { + 'wipe_special_filesystems': False + } + } + mock_execute.return_value = ('', '') + hardware.safety_check_block_device(node, '/dev/foo') + mock_execute.assert_not_called() + + def test_special_filesystem_guard_enabled_no_results(self, mock_execute): + mock_execute.return_value = ('', '') + hardware.safety_check_block_device({}, '/dev/foo') + + def test_special_filesystem_guard_raises(self, mock_execute): + GFS2 = 'FSTYPE="gfs2"\n' + GPFS1 = 'UUID="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n' + GPFS2 = 'PTUUID="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n' + GPFS3 = 'PARTTYPE="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n' + GPFS4 = 'PARTUUID="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n' + VMFS1 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n' + VMFS2 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n' + VMFS3 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n' + VMFS4 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n' + VMFS5 = 'UUID="0xfb"\n' + VMFS6 = 'PTUUID="0xfb"\n' + VMFS7 = 'PARTTYPE="0xfb"\n' + VMFS8 = 'PARTUUID="0xfb"\n' + + expected_failures = [GFS2, GPFS1, GPFS2, GPFS3, GPFS4, VMFS1, VMFS2, + VMFS3, VMFS4, VMFS5, VMFS6, VMFS7, VMFS8] + for failure in expected_failures: + mock_execute.reset_mock() + mock_execute.return_value = (failure, '') + self.assertRaises(errors.ProtectedDeviceError, + hardware.safety_check_block_device, + {}, '/dev/foo') + self.assertEqual(1, mock_execute.call_count) diff --git a/releasenotes/notes/prevent-deletion-of-shared-disk-filesystems-4c17c7666d2fe3bc.yaml b/releasenotes/notes/prevent-deletion-of-shared-disk-filesystems-4c17c7666d2fe3bc.yaml new file mode 100644 index 000000000..c29b31218 --- /dev/null +++ b/releasenotes/notes/prevent-deletion-of-shared-disk-filesystems-4c17c7666d2fe3bc.yaml @@ -0,0 +1,20 @@ +--- +fixes: + - | + Previously when the ``ironic-python-agent`` would undergo erasure of block + devices during cleaning, it would automatically attempt to erase the + contents of any "Shared Device" clustered filesystems which may be in use + by distinct multiple machines over a storage fabric. In particular + IBM GPFS, Red Hat Global File System 2, and VmWare Virtual Machine File + System (VMFS), are now identified and cleaning is halted. This is important + because should an active cluster be using the this disk, cleaning could + potentially cause the cluster to go down forcing restoration from backups. + Ideally, infrastructure operators should check their environment's storage + configuration and un-map any clustered filesystems from being visible to + Ironic nodes, unless explicitly needed and expected. Please see the + Ironic-Python-Agent troubleshooting documentation for more information. +issues: + - | + Logic to guard VMFS filesystems from being destroyed *may* not recognize + VMFS extents. Operators with examples of partitioning for extent usage + are encouraged to contact the Ironic community.