Merge "Guard shared device/cluster filesystems"

2022-07-20 08:23:55 +00:00
parent 6a1334a068 beb7484858
commit 21b21a5f15
8 changed files with 499 additions and 9 deletions
--- a/doc/source/admin/hardware_managers.rst
+++ b/doc/source/admin/hardware_managers.rst
@@ -103,3 +103,23 @@ Clean steps
    Delete the RAID configuration. This step belongs to the ``raid`` interface
    and must be used through the :ironic-doc:`ironic RAID feature
    <admin/raid.html>`.
+
+Cleaning safeguards
+-------------------
+
+The stock hardware manager contains a number of safeguards to prevent
+unsafe conditions from occuring.
+
+Shared Disk Cluster Filesystems
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Commonly used shared disk cluster filesystems, when detected, causes cleaning
+processes on stock hardware manager methods to abort prior to destroying the
+contents on the disk.
+
+These filesystems include IBM General Parallel File System (GPFS),
+VmWare Virtual Machine File System (VMFS), and Red Hat Global File System
+(GFS2).
+
+For information on troubleshooting, and disabling this check,
+see :doc:`/admin/troubleshooting`.
--- a/doc/source/admin/troubleshooting.rst
+++ b/doc/source/admin/troubleshooting.rst
@@ -24,7 +24,7 @@ image boots. Kernel command line parameters are used to do this.

 dynamic-login element example:

- Add ``sshkey="ssh-rsa BBA1..."`` to pxe_append_params setting in
+- Add ``sshkey="ssh-rsa BBA1..."`` to kernel_append_params setting in
  the ``ironic.conf`` file
 - Restart the ironic-conductor with the command
  ``service ironic-conductor restart``
@@ -76,7 +76,7 @@ are used to do this.
 dynamic-login element example::

  Generate a ENCRYPTED_PASSWORD with the openssl passwd -1 command
-  Add rootpwd="$ENCRYPTED_PASSWORD" value on the pxe_append_params setting in /etc/ironic/ironic.conf
+  Add rootpwd="$ENCRYPTED_PASSWORD" value on the kernel_append_params setting in /etc/ironic/ironic.conf
  Restart the ironic-conductor with the command service ironic-conductor restart

 Users can also be added to DIB built IPA images with the devuser element [1]_
@@ -118,7 +118,7 @@ Set IPA to debug logging
 Debug logging can be enabled a several different ways. The easiest way is to
 add ``ipa-debug=1`` to the kernel command line. To do this:

- Append ``ipa-debug=1`` to the pxe_append_params setting in the
+- Append ``ipa-debug=1`` to the kernel_append_params setting in the
  ``ironic.conf`` file
 - Restart the ironic-conductor with the command
  ``service ironic-conductor restart``
@@ -161,6 +161,96 @@ If the system uses systemd then systemctl can be used to restart the service::

  sudo systemctl restart ironic-python-agent.service

+Cleaning halted with ProtectedDeviceError
+=========================================
+
+The IPA service has halted cleaning as one of the block devices within or
+attached to the bare metal node contains a class of filesystem which **MAY**
+cause irreparable harm to a potentially running cluster if accidently removed.
+
+These filesystems *may* be used for only local storage and as a result be
+safe to erase. However if a shared block device is in use, such as a device
+supplied via a Storage Area Network utilizing protocols such as iSCSI or
+FibreChannel. Ultimately the Host Bus Adapter (HBA) may not be an entirely
+"detectable" entity given the hardware market place and aspects such as
+"SmartNICs" and Converged Network Adapters with specific offload functions
+to support standards like "NVMe over Fabric" (NVMe-oF).
+
+By default, the agent will prevent these filesystems from being deleted and
+will halt the cleaning process when detected. The cleaning process can be
+re-triggered via Ironic's state machine once one of the documented settings
+have been used to notify the agent that no action is required.
+
+What filesystems are looked for
+-------------------------------
+
+-------------------------------------------+
+| IBM General Parallel Filesystem           |
+-------------------------------------------+
+| Red Hat Global Filesystem 2               |
+-------------------------------------------+
+| VmWare Virtual Machine FileSystem (VMFS)  |
+-------------------------------------------+
+
+I'm okay with deleting, how do I tell IPA to clean the disk(s)?
+---------------------------------------------------------------
+
+Four potential ways exist to signal to IPA. Please note, all of these options
+require access either to the ndoe in Ironic's API or ability to modify Ironic
+configuration.
+
+Via Ironic
+~~~~~~~~~~
+
+.. note:: This option requires that the version of Ironic be sufficient enough
+   to understand and explicitly provide this option to the Agent.
+
+Inform Ironic to provide the option to the Agent::
+
+  baremetal node set --driver-info wipe_special_filesystems=True
+
+Via a node's kernel_append_params setting
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This may be set on a node level by utilizing the override
+``kernel_append_params`` setting which can be utilized on a node
+level. Example::
+
+  baremetal node set --driver-info kernel_append_params="ipa-guard-special-filesystems=False"
+
+Alternatively, if you wish to set this only once, you may use
+the ``instance_info`` field, which is wiped upon teardown of the node.
+Example::
+
+  baremetal node set --instance-info kernel_append_params="ipa-guard-special-filesystems=False"
+
+Via Ironic's Boot time PXE parameters (Globally)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Globally, this setting may be passed by modifying the ``ironic.conf``
+configuration file on your cluster by adding
+``ipa-guard-special-filesystems=False`` string to the
+``[pxe]kernel_append_params`` parameter.
+
+.. warning::
+   If your running a multi-conductor deployment, all of your ``ironic.conf``
+   configuration files will need to be updated to match.
+
+Via Ramdisk configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This option requires modifying the ramdisk, and is the most complex, but may
+be advisable if you have a mixed environment cluster where shared clustered
+filesystems may be a concern on some machines, but not others.
+
+.. warning::
+   This requires rebuilding your agent ramdisk, and modifying the embedded
+   configuration file for the ironic-python-agent. If your confused at all
+   by this statement, this option is not for you.
+
+Edit /etc/ironic_python_agent/ironic_python_agent.conf and set the parameter
+``[DEFAULT]guard_special_filesystems`` to ``False``.
+

 References
 ==========
--- a/doc/source/contributor/hardware_managers.rst
+++ b/doc/source/contributor/hardware_managers.rst
@@ -10,6 +10,13 @@ deploying your own hardware manager.
 IPA ships with :doc:`GenericHardwareManager </admin/hardware_managers>`, which
 implements basic cleaning and deployment methods compatible with most hardware.

+.. warning::
+   Some functionality inherent in the stock hardware manager cleaning methods
+   may be useful in custom hardware managers, but should not be inherently
+   expected to also work in custom managers. Examples of this are clustered
+   filesystem protections, and cleaning method fallback logic. Custom hardware
+   manager maintainers should be mindful when overriding the stock methods.
+
 How are methods executed on HardwareManagers?
 ---------------------------------------------
 Methods that modify hardware are dispatched to each hardware manager in
@@ -149,6 +156,18 @@ function with extra parameters.
    be set to whichever manager has a higher hardware support level and then
    use the higher priority in the case of a tie.

+In some cases, it may be necessary to create a customized cleaning step to
+take a particular pattern of behavior. Those doing such work may want to
+leverage file system safety checks, which are part of the stock hardware
+managers.
+
+.. code-block:: python
+
+    def custom_erase_devices(self, node, ports):
+        for dev in determine_my_devices_to_erase():
+            hardware.safety_check_block_device(node, dev.name)
+            my_special_cleaning(dev.name)
+
 Custom HardwareManagers and Deploying
 -------------------------------------

--- a/ironic_python_agent/config.py
+++ b/ironic_python_agent/config.py
@@ -315,6 +315,17 @@ cli_opts = [
               min=0, max=99,  # 100 is when IPA is booted
               help='Priority of the inject_files deploy step (disabled '
                    'by default), an integer between 1 and .'),
+    cfg.BoolOpt('guard-special-filesystems',
+                default=APARAMS.get('ipa-guard-special-filesystems', True),
+                help='Guard "special" shared device filesystems from '
+                     'cleaning by the stock hardware manager\'s cleaning '
+                     'methods. If one of these filesystems is detected '
+                     'during cleaning, the cleaning process will be aborted '
+                     'and infrastructure operator intervention may be '
+                     'required as this option is intended to prevent '
+                     'cleaning from inadvertently destroying a running '
+                     'cluster which may be visible over a storage fabric '
+                     'such as FibreChannel.'),
 ]

 CONF.register_cli_opts(cli_opts)
--- a/ironic_python_agent/errors.py
+++ b/ironic_python_agent/errors.py
@@ -348,3 +348,22 @@ class HeartbeatConnectionError(IronicAPIError):

    def __init__(self, details):
        super(HeartbeatConnectionError, self).__init__(details)
+
+
+class ProtectedDeviceError(CleaningError):
+    """Error raised when a cleaning is halted due to a protected device."""
+
+    message = 'Protected device located, cleaning aborted.'
+
+    def __init__(self, device, what):
+        details = ('Protected %(what)s located on device %(device)s. '
+                   'This volume or contents may be a shared block device. '
+                   'Please consult your storage administrator, and restart '
+                   'cleaning after either detaching the volumes, or '
+                   'instructing IPA to not protect contents. See Ironic '
+                   'Python Agent documentation for more information.' %
+                   {'what': what,
+                    'device': device})
+
+        self.message = details
+        super(CleaningError, self).__init__(details)
--- a/ironic_python_agent/hardware.py
+++ b/ironic_python_agent/hardware.py
@@ -924,6 +924,10 @@ class HardwareManager(object, metaclass=abc.ABCMeta):

        :param node: Ironic node object
        :param ports: list of Ironic port objects
+        :raises: ProtectedDeviceFound if a device has been identified which
+                 may require manual intervention due to the contents and
+                 operational risk which exists as it could also be a sign
+                 of an environmental misconfiguration.
        :return: a dictionary in the form {device.name: erasure output}
        """
        erase_results = {}
@@ -937,6 +941,7 @@ class HardwareManager(object, metaclass=abc.ABCMeta):
        thread_pool = ThreadPool(min(max_pool_size, len(block_devices)))
        for block_device in block_devices:
            params = {'node': node, 'block_device': block_device}
+            safety_check_block_device(node, block_device.name)
            erase_results[block_device.name] = thread_pool.apply_async(
                dispatch_to_managers, ('erase_block_device',), params)
        thread_pool.close()
@@ -1541,6 +1546,10 @@ class GenericHardwareManager(HardwareManager):
        :param ports: list of Ironic port objects
        :raises BlockDeviceEraseError: when there's an error erasing the
                block device
+        :raises: ProtectedDeviceFound if a device has been identified which
+                 may require manual intervention due to the contents and
+                 operational risk which exists as it could also be a sign
+                 of an environmental misconfiguration.
        """
        block_devices = self.list_block_devices(include_partitions=True)
        # NOTE(coreywright): Reverse sort by device name so a partition (eg
@@ -1549,6 +1558,7 @@ class GenericHardwareManager(HardwareManager):
        block_devices.sort(key=lambda dev: dev.name, reverse=True)
        erase_errors = {}
        for dev in self._list_erasable_devices():
+            safety_check_block_device(node, dev.name)
            try:
                disk_utils.destroy_disk_metadata(dev.name, node['uuid'])
            except processutils.ProcessExecutionError as e:
@@ -1572,6 +1582,10 @@ class GenericHardwareManager(HardwareManager):
        :param ports: list of Ironic port objects
        :raises BlockDeviceEraseError: when there's an error erasing the
                block device
+        :raises: ProtectedDeviceFound if a device has been identified which
+                 may require manual intervention due to the contents and
+                 operational risk which exists as it could also be a sign
+                 of an environmental misconfiguration.
        """
        erase_errors = {}
        info = node.get('driver_internal_info', {})
@@ -1579,6 +1593,7 @@ class GenericHardwareManager(HardwareManager):
            LOG.debug("No erasable devices have been found.")
            return
        for dev in self._list_erasable_devices():
+            safety_check_block_device(node, dev.name)
            try:
                if self._is_nvme(dev):
                    execute_nvme_erase = info.get(
@@ -2957,3 +2972,109 @@ def get_multipath_status():
    # as if we directly try and work with the global var, we will be racing
    # tests endlessly.
    return MULTIPATH_ENABLED
+
+
+def safety_check_block_device(node, device):
+    """Performs safety checking of a block device before destroying.
+
+    In order to guard against distruction of file systems such as
+    shared-disk file systems
+    (https://en.wikipedia.org/wiki/Clustered_file_system#SHARED-DISK)
+    or similar filesystems where multiple distinct computers may have
+    unlocked concurrent IO access to the entire block device or
+    SAN Logical Unit Number, we need to evaluate, and block cleaning
+    from occuring on these filesystems *unless* we have been explicitly
+    configured to do so.
+
+    This is because cleaning is an intentionally distructive operation,
+    and once started against such a device, given the complexities of
+    shared disk clustered filesystems where concurrent access is a design
+    element, in all likelihood the entire cluster can be negatively
+    impacted, and an operator will be forced to recover from snapshot and
+    or backups of the volume's contents.
+
+    :param node: A node, or cached node object.
+    :param device: String representing the path to the block
+                   device to be checked.
+    :raises: ProtectedDeviceError when a device is identified with
+             one of these known clustered filesystems, and the overall
+             settings have not indicated for the agent to skip such
+             safety checks.
+    """
+
+    # NOTE(TheJulia): While this seems super rare, I found out after this
+    # thread of discussion started that I have customers which have done
+    # this and wiped out SAN volumes and their contents unintentionally
+    # as a result of these filesystems not being guarded.
+    # For those not familiar with shared disk clustered filesystems, think
+    # of it as like your impacting a Ceph cluster, except your suddenly
+    # removing the underlying disks from the OSD, and the entire cluster
+    # goes down.
+
+    if not CONF.guard_special_filesystems:
+        return
+    di_info = node.get('driver_internal_info', {})
+    if not di_info.get('wipe_special_filesystems', True):
+        return
+    report, _e = il_utils.execute('lsblk', '-Pbia',
+                                  '-oFSTYPE,UUID,PTUUID,PARTTYPE,PARTUUID',
+                                  device)
+
+    lines = report.splitlines()
+
+    identified_fs_types = []
+    identified_ids = []
+    for line in lines:
+        device = {}
+        # Split into KEY=VAL pairs
+        vals = shlex.split(line)
+        if not vals:
+            continue
+        for key, val in (v.split('=', 1) for v in vals):
+            if key == 'FSTYPE':
+                identified_fs_types.append(val)
+            if key in ['UUID', 'PTUUID', 'PARTTYPE', 'PARTUUID']:
+                identified_ids.append(val)
+        # Ignore block types not specified
+
+    _check_for_special_partitions_filesystems(
+        device,
+        identified_ids,
+        identified_fs_types)
+
+
+def _check_for_special_partitions_filesystems(device, ids, fs_types):
+    """Compare supplied IDs, Types to known items, and raise if found.
+
+    :param device: The block device in use, specificially for logging.
+    :param ids: A list above IDs found to check.
+    :param fs_types: A list of FS types found to check.
+    :raises: ProtectedDeviceError should a partition label or metadata
+             be discovered which suggests a shared disk clustered filesystem
+             has been discovered.
+    """
+
+    guarded_ids = {
+        # Apparently GPFS can used shared volumes....
+        '37AFFC90-EF7D-4E96-91C3-2D7AE055B174': 'IBM GPFS Partition',
+        # Shared volume parallel filesystem
+        'AA31E02A-400F-11DB-9590-000C2911D1B8': 'VMware VMFS Partition (GPT)',
+        '0xfb': 'VMware VMFS Partition (MBR)',
+    }
+    for key, value in guarded_ids.items():
+        for id_value in ids:
+            if key == id_value:
+                raise errors.ProtectedDeviceError(
+                    device=device,
+                    what=value)
+
+    guarded_fs_types = {
+        'gfs2': 'Red Hat Global File System 2',
+    }
+
+    for key, value in guarded_fs_types.items():
+        for fs in fs_types:
+            if key == fs:
+                raise errors.ProtectedDeviceError(
+                    device=device,
+                    what=value)
--- a/ironic_python_agent/tests/unit/test_hardware.py
+++ b/ironic_python_agent/tests/unit/test_hardware.py
@@ -1710,10 +1710,12 @@ class TestGenericHardwareManager(base.IronicAgentTest):
        mocked_listdir.assert_has_calls(expected_calls)
        mocked_mpath.assert_called_once_with()

+    @mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
    @mock.patch.object(hardware, 'ThreadPool', autospec=True)
    @mock.patch.object(hardware, 'dispatch_to_managers', autospec=True)
    def test_erase_devices_no_parallel_by_default(self, mocked_dispatch,
-                                                  mock_threadpool):
+                                                  mock_threadpool,
+                                                  mock_safety_check):

        # NOTE(TheJulia): This test was previously more elaborate and
        # had a high failure rate on py37 and py38. So instead, lets just
@@ -1732,10 +1734,43 @@ class TestGenericHardwareManager(base.IronicAgentTest):
        calls = [mock.call(1)]
        self.hardware.erase_devices({}, [])
        mock_threadpool.assert_has_calls(calls)
+        mock_safety_check.assert_has_calls([
+            mock.call({}, '/dev/sdj'),
+            mock.call({}, '/dev/hdaa')
+        ])

+    @mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
+    @mock.patch.object(hardware, 'ThreadPool', autospec=True)
+    @mock.patch.object(hardware, 'dispatch_to_managers', autospec=True)
+    def test_erase_devices_no_parallel_by_default_protected_device(
+            self, mocked_dispatch,
+            mock_threadpool,
+            mock_safety_check):
+        mock_safety_check.side_effect = errors.ProtectedDeviceError(
+            device='foo',
+            what='bar')
+
+        self.hardware.list_block_devices = mock.Mock()
+
+        self.hardware.list_block_devices.return_value = [
+            hardware.BlockDevice('/dev/sdj', 'big', 1073741824, True),
+            hardware.BlockDevice('/dev/hdaa', 'small', 65535, False),
+        ]
+
+        calls = [mock.call(1)]
+        self.assertRaises(errors.ProtectedDeviceError,
+                          self.hardware.erase_devices, {}, [])
+        mock_threadpool.assert_has_calls(calls)
+        mock_safety_check.assert_has_calls([
+            mock.call({}, '/dev/sdj'),
+        ])
+        mock_threadpool.apply_async.assert_not_called()
+
+    @mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
    @mock.patch('multiprocessing.pool.ThreadPool.apply_async', autospec=True)
    @mock.patch.object(hardware, 'dispatch_to_managers', autospec=True)
-    def test_erase_devices_concurrency(self, mocked_dispatch, mocked_async):
+    def test_erase_devices_concurrency(self, mocked_dispatch, mocked_async,
+                                       mock_safety_check):
        internal_info = self.node['driver_internal_info']
        internal_info['disk_erasure_concurrency'] = 10
        mocked_dispatch.return_value = 'erased device'
@@ -1760,9 +1795,15 @@ class TestGenericHardwareManager(base.IronicAgentTest):
                 for dev in (blkdev1, blkdev2)]
        mocked_async.assert_has_calls(calls)
        self.assertEqual(expected, result)
+        mock_safety_check.assert_has_calls([
+            mock.call(self.node, '/dev/sdj'),
+            mock.call(self.node, '/dev/hdaa'),
+        ])

+    @mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
    @mock.patch.object(hardware, 'ThreadPool', autospec=True)
-    def test_erase_devices_concurrency_pool_size(self, mocked_pool):
+    def test_erase_devices_concurrency_pool_size(self, mocked_pool,
+                                                 mock_safety_check):
        self.hardware.list_block_devices = mock.Mock()
        self.hardware.list_block_devices.return_value = [
            hardware.BlockDevice('/dev/sdj', 'big', 1073741824, True),
@@ -1782,6 +1823,10 @@ class TestGenericHardwareManager(base.IronicAgentTest):

        self.hardware.erase_devices(self.node, [])
        mocked_pool.assert_called_with(1)
+        mock_safety_check.assert_has_calls([
+            mock.call(self.node, '/dev/sdj'),
+            mock.call(self.node, '/dev/hdaa'),
+        ])

    @mock.patch.object(hardware, 'dispatch_to_managers', autospec=True)
    def test_erase_devices_without_disk(self, mocked_dispatch):
@@ -2441,6 +2486,7 @@ class TestGenericHardwareManager(base.IronicAgentTest):
            mock.call('/sys/fs/pstore/' + arg) for arg in pstore_entries
        ])

+    @mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
    @mock.patch.object(il_utils, 'execute', autospec=True)
    @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
    @mock.patch.object(hardware.GenericHardwareManager,
@@ -2449,7 +2495,7 @@ class TestGenericHardwareManager(base.IronicAgentTest):
                       '_list_erasable_devices', autospec=True)
    def test_erase_devices_express(
            self, mock_list_erasable_devices, mock_nvme_erase,
-            mock_destroy_disk_metadata, mock_execute):
+            mock_destroy_disk_metadata, mock_execute, mock_safety_check):
        block_devices = [
            hardware.BlockDevice('/dev/sda', 'sata', 65535, False),
            hardware.BlockDevice('/dev/md0', 'raid-device', 32767, False),
@@ -2466,7 +2512,44 @@ class TestGenericHardwareManager(base.IronicAgentTest):
                         mock.call('/dev/md0', self.node['uuid'])],
                         mock_destroy_disk_metadata.call_args_list)
        mock_list_erasable_devices.assert_called_with(self.hardware)
+        mock_safety_check.assert_has_calls([
+            mock.call(self.node, '/dev/sda'),
+            mock.call(self.node, '/dev/md0'),
+            mock.call(self.node, '/dev/nvme0n1'),
+            mock.call(self.node, '/dev/nvme1n1')
+        ])

+    @mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
+    @mock.patch.object(il_utils, 'execute', autospec=True)
+    @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
+    @mock.patch.object(hardware.GenericHardwareManager,
+                       '_nvme_erase', autospec=True)
+    @mock.patch.object(hardware.GenericHardwareManager,
+                       '_list_erasable_devices', autospec=True)
+    def test_erase_devices_express_stops_on_safety_failure(
+            self, mock_list_erasable_devices, mock_nvme_erase,
+            mock_destroy_disk_metadata, mock_execute, mock_safety_check):
+        mock_safety_check.side_effect = errors.ProtectedDeviceError(
+            device='foo',
+            what='bar')
+        block_devices = [
+            hardware.BlockDevice('/dev/sda', 'sata', 65535, False),
+            hardware.BlockDevice('/dev/md0', 'raid-device', 32767, False),
+            hardware.BlockDevice('/dev/nvme0n1', 'nvme', 32767, False),
+            hardware.BlockDevice('/dev/nvme1n1', 'nvme', 32767, False)
+        ]
+        mock_list_erasable_devices.return_value = list(block_devices)
+
+        self.assertRaises(errors.ProtectedDeviceError,
+                          self.hardware.erase_devices_express, self.node, [])
+        mock_nvme_erase.assert_not_called()
+        mock_destroy_disk_metadata.assert_not_called()
+        mock_list_erasable_devices.assert_called_with(self.hardware)
+        mock_safety_check.assert_has_calls([
+            mock.call(self.node, '/dev/sda')
+        ])
+
+    @mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
    @mock.patch.object(il_utils, 'execute', autospec=True)
    @mock.patch.object(hardware.GenericHardwareManager,
                       '_is_virtual_media_device', autospec=True)
@@ -2475,7 +2558,7 @@ class TestGenericHardwareManager(base.IronicAgentTest):
    @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
    def test_erase_devices_metadata(
            self, mock_metadata, mock_list_devs, mock__is_vmedia,
-            mock_execute):
+            mock_execute, mock_safety_check):
        block_devices = [
            hardware.BlockDevice('/dev/sr0', 'vmedia', 12345, True),
            hardware.BlockDevice('/dev/sdb2', 'raid-member', 32767, False),
@@ -2509,7 +2592,62 @@ class TestGenericHardwareManager(base.IronicAgentTest):
                          mock.call(self.hardware, block_devices[2]),
                          mock.call(self.hardware, block_devices[5])],
                         mock__is_vmedia.call_args_list)
+        mock_safety_check.assert_has_calls([
+            mock.call(self.node, '/dev/sda1'),
+            mock.call(self.node, '/dev/sda'),
+            # This is kind of redundant code/pattern behavior wise
+            # but you never know what someone has done...
+            mock.call(self.node, '/dev/md0')
+        ])

+    @mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
+    @mock.patch.object(il_utils, 'execute', autospec=True)
+    @mock.patch.object(hardware.GenericHardwareManager,
+                       '_is_virtual_media_device', autospec=True)
+    @mock.patch.object(hardware.GenericHardwareManager,
+                       'list_block_devices', autospec=True)
+    @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
+    def test_erase_devices_metadata_safety_check(
+            self, mock_metadata, mock_list_devs, mock__is_vmedia,
+            mock_execute, mock_safety_check):
+        block_devices = [
+            hardware.BlockDevice('/dev/sr0', 'vmedia', 12345, True),
+            hardware.BlockDevice('/dev/sdb2', 'raid-member', 32767, False),
+            hardware.BlockDevice('/dev/sda', 'small', 65535, False),
+            hardware.BlockDevice('/dev/sda1', '', 32767, False),
+            hardware.BlockDevice('/dev/sda2', 'raid-member', 32767, False),
+            hardware.BlockDevice('/dev/md0', 'raid-device', 32767, False)
+        ]
+        # NOTE(coreywright): Don't return the list, but a copy of it, because
+        # we depend on its elements' order when referencing it later during
+        # verification, but the method under test sorts the list changing it.
+        mock_list_devs.return_value = list(block_devices)
+        mock__is_vmedia.side_effect = lambda _, dev: dev.name == '/dev/sr0'
+        mock_execute.side_effect = [
+            ('sdb2 linux_raid_member host:1 f9978968', ''),
+            ('sda2 linux_raid_member host:1 f9978969', ''),
+            ('sda1', ''), ('sda', ''), ('md0', '')]
+        mock_safety_check.side_effect = [
+            None,
+            errors.ProtectedDeviceError(
+                device='foo',
+                what='bar')
+        ]
+
+        self.assertRaises(errors.ProtectedDeviceError,
+                          self.hardware.erase_devices_metadata,
+                          self.node, [])
+
+        self.assertEqual([mock.call('/dev/sda1', self.node['uuid'])],
+                         mock_metadata.call_args_list)
+        mock_list_devs.assert_called_with(self.hardware,
+                                          include_partitions=True)
+        mock_safety_check.assert_has_calls([
+            mock.call(self.node, '/dev/sda1'),
+            mock.call(self.node, '/dev/sda'),
+        ])
+
+    @mock.patch.object(hardware, 'safety_check_block_device', autospec=True)
    @mock.patch.object(hardware.GenericHardwareManager,
                       '_is_linux_raid_member', autospec=True)
    @mock.patch.object(hardware.GenericHardwareManager,
@@ -2519,7 +2657,7 @@ class TestGenericHardwareManager(base.IronicAgentTest):
    @mock.patch.object(disk_utils, 'destroy_disk_metadata', autospec=True)
    def test_erase_devices_metadata_error(
            self, mock_metadata, mock_list_devs, mock__is_vmedia,
-            mock__is_raid_member):
+            mock__is_raid_member, mock_safety_check):
        block_devices = [
            hardware.BlockDevice('/dev/sda', 'small', 65535, False),
            hardware.BlockDevice('/dev/sdb', 'big', 10737418240, True),
@@ -2553,6 +2691,10 @@ class TestGenericHardwareManager(base.IronicAgentTest):
        self.assertEqual([mock.call(self.hardware, block_devices[1]),
                          mock.call(self.hardware, block_devices[0])],
                         mock__is_vmedia.call_args_list)
+        mock_safety_check.assert_has_calls([
+            mock.call(self.node, '/dev/sdb'),
+            mock.call(self.node, '/dev/sda')
+        ])

    @mock.patch.object(il_utils, 'execute', autospec=True)
    def test__is_linux_raid_member(self, mocked_execute):
@@ -4991,3 +5133,51 @@ class TestAPIClientSaveAndUse(base.IronicAgentTest):
        calls = [mock.call('list_hardware_info'),
                 mock.call('wait_for_disks')]
        mock_dispatch.assert_has_calls(calls)
+
+
+@mock.patch.object(il_utils, 'execute', autospec=True)
+class TestProtectedDiskSafetyChecks(base.IronicAgentTest):
+
+    def test_special_filesystem_guard_not_enabled(self, mock_execute):
+        CONF.set_override('guard_special_filesystems', False)
+        hardware.safety_check_block_device({}, '/dev/foo')
+        mock_execute.assert_not_called()
+
+    def test_special_filesystem_guard_node_indicates_skip(self, mock_execute):
+        node = {
+            'driver_internal_info': {
+                'wipe_special_filesystems': False
+            }
+        }
+        mock_execute.return_value = ('', '')
+        hardware.safety_check_block_device(node, '/dev/foo')
+        mock_execute.assert_not_called()
+
+    def test_special_filesystem_guard_enabled_no_results(self, mock_execute):
+        mock_execute.return_value = ('', '')
+        hardware.safety_check_block_device({}, '/dev/foo')
+
+    def test_special_filesystem_guard_raises(self, mock_execute):
+        GFS2 = 'FSTYPE="gfs2"\n'
+        GPFS1 = 'UUID="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n'
+        GPFS2 = 'PTUUID="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n'
+        GPFS3 = 'PARTTYPE="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n'
+        GPFS4 = 'PARTUUID="37AFFC90-EF7D-4E96-91C3-2D7AE055B174"\n'
+        VMFS1 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n'
+        VMFS2 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n'
+        VMFS3 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n'
+        VMFS4 = 'UUID="AA31E02A-400F-11DB-9590-000C2911D1B8"\n'
+        VMFS5 = 'UUID="0xfb"\n'
+        VMFS6 = 'PTUUID="0xfb"\n'
+        VMFS7 = 'PARTTYPE="0xfb"\n'
+        VMFS8 = 'PARTUUID="0xfb"\n'
+
+        expected_failures = [GFS2, GPFS1, GPFS2, GPFS3, GPFS4, VMFS1, VMFS2,
+                             VMFS3, VMFS4, VMFS5, VMFS6, VMFS7, VMFS8]
+        for failure in expected_failures:
+            mock_execute.reset_mock()
+            mock_execute.return_value = (failure, '')
+            self.assertRaises(errors.ProtectedDeviceError,
+                              hardware.safety_check_block_device,
+                              {}, '/dev/foo')
+            self.assertEqual(1, mock_execute.call_count)
--- a/releasenotes/notes/prevent-deletion-of-shared-disk-filesystems-4c17c7666d2fe3bc.yaml
+++ b/releasenotes/notes/prevent-deletion-of-shared-disk-filesystems-4c17c7666d2fe3bc.yaml
@@ -0,0 +1,20 @@
+---
+fixes:
+  - |
+    Previously when the ``ironic-python-agent`` would undergo erasure of block
+    devices during cleaning, it would automatically attempt to erase the
+    contents of any "Shared Device" clustered filesystems which may be in use
+    by distinct multiple machines over a storage fabric. In particular
+    IBM GPFS, Red Hat Global File System 2, and VmWare Virtual Machine File
+    System (VMFS), are now identified and cleaning is halted. This is important
+    because should an active cluster be using the this disk, cleaning could
+    potentially cause the cluster to go down forcing restoration from backups.
+    Ideally, infrastructure operators should check their environment's storage
+    configuration and un-map any clustered filesystems from being visible to
+    Ironic nodes, unless explicitly needed and expected. Please see the
+    Ironic-Python-Agent troubleshooting documentation for more information.
+issues:
+  - |
+    Logic to guard VMFS filesystems from being destroyed *may* not recognize
+    VMFS extents. Operators with examples of partitioning for extent usage
+    are encouraged to contact the Ironic community.