Concurrent Distructive/Intensive ops limits

Provide the ability to limit resource intensive or potentially wide scale operations which could be a symptom of a highly distructive and unplanned operation in progress. The idea behind this change is to help guard the overall deployment to prevent an overall resource exhaustion situation, or prevent an attacker with valid credentials from putting an entire deployment into a potentially disasterous cleaning situation since ironic only other wise limits concurrency based upon running tasks by conductor. Story: 2010007 Task: 45140 Change-Id: I642452cd480e7674ff720b65ca32bce59a4a834a
2022-04-29 16:48:43 -07:00 · 2022-04-29 16:48:43 -07:00 · 9a8b1d149c
commit 9a8b1d149c
parent b9af8e4ef3
9 changed files with 287 additions and 4 deletions
--- a/doc/source/admin/troubleshooting.rst
+++ b/doc/source/admin/troubleshooting.rst
@ -973,3 +973,40 @@ Unfortunately, due to the way the conductor is designed, it is not possible to
 gracefully break a stuck lock held in ``*-ing`` states. As the last resort, you
 may need to restart the affected conductor. See `Why are my nodes stuck in a
 "-ing" state?`_.
+
+What is ConcurrentActionLimit?
+==============================
+
+ConcurrentActionLimit is an exception which is raised to clients when an
+operation is requested, but cannot be serviced at that moment because the
+overall threshold of nodes in concurrent "Deployment" or "Cleaning"
+operations has been reached.
+
+These limits exist for two distinct reasons.
+
+The first is they allow an operator to tune a deployment such that too many
+concurrent deployments cannot be triggered at any given time, as a single
+conductor has an internal limit to the number of overall concurrent tasks,
+this restricts only the number of running concurrent actions. As such, this
+accounts for the number of nodes in ``deploy`` and ``deploy wait`` states.
+In the case of deployments, the default value is relatively high and should
+be suitable for *most* larger operators.
+
+The second is to help slow down the ability in which an entire population of
+baremetal nodes can be moved into and through cleaning, in order to help
+guard against authenticated malicious users, or accidental script driven
+operations. In this case, the total number of nodes in ``deleting``,
+``cleaning``, and ``clean wait`` are evaluated. The default maximum limit
+for cleaning operations is *50* and should be suitable for the majority of
+baremetal operators.
+
+These settings can be modified by using the
+``[conductor]max_concurrent_deploy`` and ``[conductor]max_concurrent_clean``
+settings from the ironic.conf file supporting the ``ironic-conductor``
+service. Neither setting can be explicity disabled, however there is also no
+upper limit to the setting.
+
+.. note::
+   This was an infrastructure operator requested feature from actual lessons
+   learned in the operation of Ironic in large scale production. The defaults
+   may not be suitable for the largest scale operators.
--- a/ironic/common/exception.py
+++ b/ironic/common/exception.py
@ -851,3 +851,13 @@ class ImageRefIsARedirect(IronicException):
            message=msg,
            image_ref=image_ref,
            redirect_url=redirect_url)
+
+
+class ConcurrentActionLimit(IronicException):
+    # NOTE(TheJulia): We explicitly don't report the concurrent
+    # action limit configuration value as a security guard since
+    # if informed of the limit, an attacker can tailor their attack.
+    _msg_fmt = _("Unable to process request at this time. "
+                 "The concurrent action limit for %(task_type)s "
+                 "has been reached. Please contact your administrator "
+                 "and try again later.")
--- a/ironic/conductor/manager.py
+++ b/ironic/conductor/manager.py
@ -886,7 +886,8 @@ class ConductorManager(base_manager.BaseConductorManager):
                                   exception.NodeInMaintenance,
                                   exception.InstanceDeployFailure,
                                   exception.InvalidStateRequested,
-                                   exception.NodeProtected)
+                                   exception.NodeProtected,
+                                   exception.ConcurrentActionLimit)
    def do_node_deploy(self, context, node_id, rebuild=False,
                       configdrive=None, deploy_steps=None):
        """RPC method to initiate deployment to a node.
@ -910,8 +911,11 @@ class ConductorManager(base_manager.BaseConductorManager):
        :raises: InvalidStateRequested when the requested state is not a valid
                 target from the current state.
        :raises: NodeProtected if the node is protected.
+        :raises: ConcurrentActionLimit if this action would exceed the maximum
+                 number of configured concurrent actions of this type.
        """
        LOG.debug("RPC do_node_deploy called for node %s.", node_id)
+        self._concurrent_action_limit(action='provisioning')
        event = 'rebuild' if rebuild else 'deploy'

        # NOTE(comstud): If the _sync_power_states() periodic task happens
@ -983,7 +987,8 @@ class ConductorManager(base_manager.BaseConductorManager):
                                   exception.NodeLocked,
                                   exception.InstanceDeployFailure,
                                   exception.InvalidStateRequested,
-                                   exception.NodeProtected)
+                                   exception.NodeProtected,
+                                   exception.ConcurrentActionLimit)
    def do_node_tear_down(self, context, node_id):
        """RPC method to tear down an existing node deployment.

@ -998,8 +1003,11 @@ class ConductorManager(base_manager.BaseConductorManager):
        :raises: InvalidStateRequested when the requested state is not a valid
                 target from the current state.
        :raises: NodeProtected if the node is protected.
+        :raises: ConcurrentActionLimit if this action would exceed the maximum
+                 number of configured concurrent actions of this type.
        """
        LOG.debug("RPC do_node_tear_down called for node %s.", node_id)
+        self._concurrent_action_limit(action='unprovisioning')

        with task_manager.acquire(context, node_id, shared=False,
                                  purpose='node tear down') as task:
@ -1121,7 +1129,8 @@ class ConductorManager(base_manager.BaseConductorManager):
                                   exception.InvalidStateRequested,
                                   exception.NodeInMaintenance,
                                   exception.NodeLocked,
-                                   exception.NoFreeConductorWorker)
+                                   exception.NoFreeConductorWorker,
+                                   exception.ConcurrentActionLimit)
    def do_node_clean(self, context, node_id, clean_steps,
                      disable_ramdisk=False):
        """RPC method to initiate manual cleaning.
@ -1150,7 +1159,10 @@ class ConductorManager(base_manager.BaseConductorManager):
        :raises: NodeLocked if node is locked by another conductor.
        :raises: NoFreeConductorWorker when there is no free worker to start
                 async task.
+        :raises: ConcurrentActionLimit If this action would exceed the
+                 configured limits of the deployment.
        """
+        self._concurrent_action_limit(action='cleaning')
        with task_manager.acquire(context, node_id, shared=False,
                                  purpose='node manual cleaning') as task:
            node = task.node
@ -3549,6 +3561,40 @@ class ConductorManager(base_manager.BaseConductorManager):
            # impact DB access if done in excess.
            eventlet.sleep(0)

+    def _concurrent_action_limit(self, action):
+        """Check Concurrency limits and block operations if needed.
+
+        This method is used to serve as a central place for the logic
+        for checks on concurrency limits. If a limit is reached, then
+        an appropriate exception is raised.
+
+        :raises: ConcurrentActionLimit If the system configuration
+                 is exceeded.
+        """
+        # NOTE(TheJulia): Keeping this all in one place for simplicity.
+        if action == 'provisioning':
+            node_count = self.dbapi.count_nodes_in_provision_state([
+                states.DEPLOYING,
+                states.DEPLOYWAIT
+            ])
+            if node_count >= CONF.conductor.max_concurrent_deploy:
+                raise exception.ConcurrentActionLimit(
+                    task_type=action)
+
+        if action == 'unprovisioning' or action == 'cleaning':
+            # NOTE(TheJulia): This also checks for the deleting state
+            # which is super transitory, *but* you can get a node into
+            # the state. So in order to guard against a DoS attack, we
+            # need to check even the super transitory node state.
+            node_count = self.dbapi.count_nodes_in_provision_state([
+                states.DELETING,
+                states.CLEANING,
+                states.CLEANWAIT
+            ])
+            if node_count >= CONF.conductor.max_concurrent_clean:
+                raise exception.ConcurrentActionLimit(
+                    task_type=action)
+

@METRICS.timer('get_vendor_passthru_metadata')
 def get_vendor_passthru_metadata(route_dict):
--- a/ironic/conf/conductor.py
+++ b/ironic/conf/conductor.py
@ -358,6 +358,32 @@ opts = [
                       'model. The conductor does *not* record this value '
                       'otherwise, and this information is not backfilled '
                       'for prior instances which have been deployed.')),
+    cfg.IntOpt('max_concurrent_deploy',
+               default=250,
+               min=1,
+               mutable=True,
+               help=_('The maximum number of concurrent nodes in deployment '
+                      'which are permitted in this Ironic system. '
+                      'If this limit is reached, new requests will be '
+                      'rejected until the number of deployments in progress '
+                      'is lower than this maximum. As this is a security '
+                      'mechanism requests are not queued, and this setting '
+                      'is a global setting applying to all requests this '
+                      'conductor receives, regardless of access rights. '
+                      'The concurrent deployment limit cannot be disabled.')),
+    cfg.IntOpt('max_concurrent_clean',
+               default=50,
+               min=1,
+               mutable=True,
+               help=_('The maximum number of concurrent nodes in cleaning '
+                      'which are permitted in this Ironic system. '
+                      'If this limit is reached, new requests will be '
+                      'rejected until the number of nodes in cleaning '
+                      'is lower than this maximum. As this is a security '
+                      'mechanism requests are not queued, and this setting '
+                      'is a global setting applying to all requests this '
+                      'conductor receives, regardless of access rights. '
+                      'The concurrent clean limit cannot be disabled.')),
 ]


--- a/ironic/db/api.py
+++ b/ironic/db/api.py
@ -1416,3 +1416,12 @@ class Connection(object, metaclass=abc.ABCMeta):
        :param entires: A list of node history entriy id's to be
                        queried for deletion.
        """
+
+    @abc.abstractmethod
+    def count_nodes_in_provision_state(self, state):
+        """Count the number of nodes in given provision state.
+
+        :param state: A provision_state value to match for the
+                      count operation. This can be a single provision
+                      state value or a list of values.
+        """
--- a/ironic/db/sqlalchemy/api.py
+++ b/ironic/db/sqlalchemy/api.py
@ -30,6 +30,7 @@ from oslo_utils import timeutils
 from oslo_utils import uuidutils
 from osprofiler import sqlalchemy as osp_sqlalchemy
 import sqlalchemy as sa
+from sqlalchemy import or_
 from sqlalchemy.orm.exc import NoResultFound, MultipleResultsFound
 from sqlalchemy.orm import joinedload
 from sqlalchemy.orm import Load
@ -2400,3 +2401,26 @@ class Connection(api.Connection):
            ).filter(
                models.NodeHistory.id.in_(entries)
            ).delete(synchronize_session=False)
+
+    def count_nodes_in_provision_state(self, state):
+        if not isinstance(state, list):
+            state = [state]
+        with _session_for_read() as session:
+            # Intentionally does not use the full ORM model
+            # because that is de-duped by pkey, but we already
+            # have unique constraints on UUID/name, so... shouldn't
+            # be a big deal. #JuliaFamousLastWords.
+            # Anyway, intent here is to be as quick as possible and
+            # literally have the DB do *all* of the world, so no
+            # client side ops occur. The column is also indexed,
+            # which means this will be an index based response.
+            # TODO(TheJulia): This might need to be revised for
+            # SQLAlchemy 2.0 as it should be a scaler select and count
+            # instead.
+            return session.query(
+                models.Node.provision_state
+            ).filter(
+                or_(
+                    models.Node.provision_state == v for v in state
+                )
+            ).count()
--- a/ironic/tests/unit/conductor/test_manager.py
+++ b/ironic/tests/unit/conductor/test_manager.py
@ -1829,6 +1829,7 @@ class ServiceDoNodeDeployTestCase(mgr_utils.ServiceSetUpMixin,

    def test_do_node_deploy_maintenance(self, mock_iwdi):
        mock_iwdi.return_value = False
+        self._start_service()
        node = obj_utils.create_test_node(self.context, driver='fake-hardware',
                                          maintenance=True)
        exc = self.assertRaises(messaging.rpc.ExpectedException,
@ -1843,6 +1844,7 @@ class ServiceDoNodeDeployTestCase(mgr_utils.ServiceSetUpMixin,
        self.assertFalse(mock_iwdi.called)

    def _test_do_node_deploy_validate_fail(self, mock_validate, mock_iwdi):
+        self._start_service()
        mock_iwdi.return_value = False
        # InvalidParameterValue should be re-raised as InstanceDeployFailure
        mock_validate.side_effect = exception.InvalidParameterValue('error')
@ -2389,6 +2391,7 @@ class DoNodeTearDownTestCase(mgr_utils.ServiceSetUpMixin, db_base.DbTestCase):
    @mock.patch('ironic.drivers.modules.fake.FakePower.validate',
                autospec=True)
    def test_do_node_tear_down_validate_fail(self, mock_validate):
+        self._start_service()
        # InvalidParameterValue should be re-raised as InstanceDeployFailure
        mock_validate.side_effect = exception.InvalidParameterValue('error')
        node = obj_utils.create_test_node(
@ -8374,7 +8377,6 @@ class NodeHistoryRecordCleanupTestCase(mgr_utils.ServiceSetUpMixin,
        # 9 retained due to days, 3 to config
        self.service._manage_node_history(self.context)
        events = objects.NodeHistory.list(self.context)
-        print(events)
        self.assertEqual(12, len(events))
        events = objects.NodeHistory.list_by_node_id(self.context, 10)
        self.assertEqual(4, len(events))
@ -8394,3 +8396,73 @@ class NodeHistoryRecordCleanupTestCase(mgr_utils.ServiceSetUpMixin,
        self.assertEqual('one', events[1].event)
        self.assertEqual('two', events[2].event)
        self.assertEqual('three', events[3].event)
+
+
+class ConcurrentActionLimitTestCase(mgr_utils.ServiceSetUpMixin,
+                                    db_base.DbTestCase):
+
+    def setUp(self):
+        super(ConcurrentActionLimitTestCase, self).setUp()
+        self._start_service()
+        self.node1 = obj_utils.get_test_node(
+            self.context,
+            driver='fake-hardware',
+            id=110,
+            uuid=uuidutils.generate_uuid())
+        self.node2 = obj_utils.get_test_node(
+            self.context,
+            driver='fake-hardware',
+            id=111,
+            uuid=uuidutils.generate_uuid())
+        self.node3 = obj_utils.get_test_node(
+            self.context,
+            driver='fake-hardware',
+            id=112,
+            uuid=uuidutils.generate_uuid())
+        self.node4 = obj_utils.get_test_node(
+            self.context,
+            driver='fake-hardware',
+            id=113,
+            uuid=uuidutils.generate_uuid())
+        # Create the nodes, as the tasks need to operate across tables.
+        self.node1.create()
+        self.node2.create()
+        self.node3.create()
+        self.node4.create()
+
+    def test_concurrent_action_limit_deploy(self):
+        self.node1.provision_state = states.DEPLOYING
+        self.node2.provision_state = states.DEPLOYWAIT
+        self.node1.save()
+        self.node2.save()
+        CONF.set_override('max_concurrent_deploy', 2, group='conductor')
+        self.assertRaises(
+            exception.ConcurrentActionLimit,
+            self.service._concurrent_action_limit,
+            'provisioning')
+        self.service._concurrent_action_limit('unprovisioning')
+        self.service._concurrent_action_limit('cleaning')
+        CONF.set_override('max_concurrent_deploy', 3, group='conductor')
+        self.service._concurrent_action_limit('provisioning')
+
+    def test_concurrent_action_limit_cleaning(self):
+        self.node1.provision_state = states.DELETING
+        self.node2.provision_state = states.CLEANING
+        self.node3.provision_state = states.CLEANWAIT
+        self.node1.save()
+        self.node2.save()
+        self.node3.save()
+
+        CONF.set_override('max_concurrent_clean', 3, group='conductor')
+        self.assertRaises(
+            exception.ConcurrentActionLimit,
+            self.service._concurrent_action_limit,
+            'cleaning')
+        self.assertRaises(
+            exception.ConcurrentActionLimit,
+            self.service._concurrent_action_limit,
+            'unprovisioning')
+        self.service._concurrent_action_limit('provisioning')
+        CONF.set_override('max_concurrent_clean', 4, group='conductor')
+        self.service._concurrent_action_limit('cleaning')
+        self.service._concurrent_action_limit('unprovisioning')
--- a/ironic/tests/unit/db/test_nodes.py
+++ b/ironic/tests/unit/db/test_nodes.py
@ -1081,3 +1081,39 @@ class DbNodeTestCase(base.DbTestCase):
                                self.dbapi.check_node_list,
                                [node1.uuid, 'this/cannot/be/a/name'])
        self.assertIn('this/cannot/be/a/name', str(exc))
+
+    def test_node_provision_state_count(self):
+        active_nodes = 5
+        manageable_nodes = 3
+        deploywait_nodes = 1
+        for i in range(0, active_nodes):
+            utils.create_test_node(uuid=uuidutils.generate_uuid(),
+                                   provision_state=states.ACTIVE)
+        for i in range(0, manageable_nodes):
+            utils.create_test_node(uuid=uuidutils.generate_uuid(),
+                                   provision_state=states.MANAGEABLE)
+        for i in range(0, deploywait_nodes):
+            utils.create_test_node(uuid=uuidutils.generate_uuid(),
+                                   provision_state=states.DEPLOYWAIT)
+
+        self.assertEqual(
+            active_nodes,
+            self.dbapi.count_nodes_in_provision_state(states.ACTIVE)
+        )
+        self.assertEqual(
+            manageable_nodes,
+            self.dbapi.count_nodes_in_provision_state(states.MANAGEABLE)
+        )
+        self.assertEqual(
+            deploywait_nodes,
+            self.dbapi.count_nodes_in_provision_state(states.DEPLOYWAIT)
+        )
+        total = active_nodes + manageable_nodes + deploywait_nodes
+        self.assertEqual(
+            total,
+            self.dbapi.count_nodes_in_provision_state([
+                states.ACTIVE,
+                states.MANAGEABLE,
+                states.DEPLOYWAIT
+            ])
+        )
--- a/releasenotes/notes/concurrency-limit-control-4b101bca7136e08d.yaml
+++ b/releasenotes/notes/concurrency-limit-control-4b101bca7136e08d.yaml
@ -0,0 +1,23 @@
+---
+features:
+  - |
+    Adds a concurrency limiter for number of nodes in states related to
+    *Cleaning* and *Provisioning* operations across the ironic deployment.
+    These settings default to a maximum number of concurrent deployments to
+    ``250`` and a maximum number of concurrent deletes and cleaning operations
+    to ``50``. These settings can be tuned using
+    ``[conductor]max_concurrent_deploy`` and
+    ``[conductor]max_concurrent_clean``, respectively.
+    The defaults should generally be good for most operators in most cases.
+    Large scale operators should evaluate the defaults and tune appropriately
+    as this feature cannot be disabled, as it is a security mechanism.
+upgrade:
+  - |
+    Large scale operators should be aware that a new feature, referred to as
+    "Concurrent Action Limit" was introduced as a security mechanism to
+    provide a means to limit attackers, or faulty scripts, from potentially
+    causing irreperable harm to an environment. This feature cannot be
+    disabled, and operators are encouraged to tune the new settings
+    ``[conductor]max_concurrent_deploy`` and
+    ``[conductor]max_concurrent_clean`` to match the needs of their
+    environment.