Browse Source

nova-manage: heal port allocations

Before I97f06d0ec34cbd75c182caaa686b8de5c777a576 it was possible to
create servers with neutron ports which had resource_request (e.g. a
port with QoS minimum bandwidth policy rule) without allocating the
requested resources in placement. So there could be servers for which
the allocation needs to be healed in placement.

This patch extends the nova-manage heal_allocation CLI to create the
missing port allocations in placement and update the port in neutron
with the resource provider uuid that is used for the allocation.

There are known limiations of this patch. It does not try to reimplement
Placement's allocation candidate functionality. Therefore it cannot
handle the situation when there is more than one RP in the compute
tree which provides the required traits for a port. In this situation
deciding which RP to use would require the in_tree allocation candidate
support from placement which is not available yet and 2) information
about which PCI PF an SRIOV port is allocated from its VF and which RP
represents that PCI device in placement. This information is only
available on the compute hosts.

For the unsupported cases the command will fail gracefully. As soon as
migration support for such servers are implemented in the blueprint
support-move-ops-with-qos-ports the admin can heal the allocation of
such servers by migrating them.

During healing both placement and neutron need to be updated. If any of
those updates fail the code tries to roll back the previous updates for
the instance to make sure that the healing can be re-run later without
issue. However if the rollback fails then the script will terminate with
an error message pointing to documentation that describes how to
recover from such a partially healed situation manually.

Closes-Bug: #1819923
Change-Id: I4b2b1688822eb2f0174df0c8c6c16d554781af85
tags/20.0.0.0rc1
Balazs Gibizer 7 months ago
parent
commit
54dea2531c

+ 40
- 1
doc/source/cli/nova-manage.rst View File

@@ -367,13 +367,37 @@ Nova Cells v2
367 367
 Placement
368 368
 ~~~~~~~~~
369 369
 
370
-``nova-manage placement heal_allocations [--max-count <max_count>] [--verbose] [--dry-run] [--instance <instance_uuid>]``
370
+``nova-manage placement heal_allocations [--max-count <max_count>] [--verbose] [--skip-port-allocations] [--dry-run] [--instance <instance_uuid>]``
371 371
     Iterates over non-cell0 cells looking for instances which do not have
372 372
     allocations in the Placement service and which are not undergoing a task
373 373
     state transition. For each instance found, allocations are created against
374 374
     the compute node resource provider for that instance based on the flavor
375 375
     associated with the instance.
376 376
 
377
+    Also if the instance has any port attached that has resource request
378
+    (e.g. :neutron-doc:`Quality of Service (QoS): Guaranteed Bandwidth
379
+    <admin/config-qos-min-bw.html>`) but the corresponding
380
+    allocation is not found then the allocation is created against the
381
+    network device resource providers according to the resource request of
382
+    that port. It is possible that the missing allocation cannot be created
383
+    either due to not having enough resource inventory on the host the instance
384
+    resides on or because more than one resource provider could fulfill the
385
+    request. In this case the instance needs to be manually deleted or the
386
+    port needs to be detached.  When nova `supports migrating instances
387
+    with guaranteed bandwidth ports`_, migration will heal missing allocations
388
+    for these instances.
389
+
390
+    Before the allocations for the ports are persisted in placement nova-manage
391
+    tries to update each port in neutron to refer to the resource provider UUID
392
+    which provides the requested resources. If any of the port updates fail in
393
+    neutron or the allocation update fails in placement the command tries to
394
+    roll back the partial updates to the ports. If the roll back fails
395
+    then the process stops with exit code ``7`` and the admin needs to do the
396
+    rollback in neutron manually according to the description in the exit code
397
+    section.
398
+
399
+    .. _supports migrating instances with guaranteed bandwidth ports: https://specs.openstack.org/openstack/nova-specs/specs/train/approved/support-move-ops-with-qos-ports.html
400
+
377 401
     There is also a special case handled for instances that *do* have
378 402
     allocations created before Placement API microversion 1.8 where project_id
379 403
     and user_id values were required. For those types of allocations, the
@@ -393,6 +417,13 @@ Placement
393 417
     specified the ``--max-count`` option has no effect.
394 418
     *(Since 20.0.0 Train)*
395 419
 
420
+    Specify ``--skip-port-allocations`` to skip the healing of the resource
421
+    allocations of bound ports, e.g. healing bandwidth resource allocation for
422
+    ports having minimum QoS policy rules attached. If your deployment does
423
+    not use such a feature then the performance impact of querying neutron
424
+    ports for each instance can be avoided with this flag.
425
+    *(Since 20.0.0 Train)*
426
+
396 427
     This command requires that the ``[api_database]/connection`` and
397 428
     ``[placement]`` configuration options are set. Placement API >= 1.28 is
398 429
     required.
@@ -405,6 +436,14 @@ Placement
405 436
     * 3: Unable to create (or update) allocations for an instance against its
406 437
       compute node resource provider.
407 438
     * 4: Command completed successfully but no allocations were created.
439
+    * 5: Unable to query ports from neutron
440
+    * 6: Unable to update ports in neutron
441
+    * 7: Cannot roll back neutron port updates. Manual steps needed. The error
442
+      message will indicate which neutron ports need to be changed to clean up
443
+      ``binding:profile`` of the port::
444
+
445
+        $ openstack port unset <port_uuid> --binding-profile allocation
446
+
408 447
     * 127: Invalid input.
409 448
 
410 449
 ``nova-manage placement sync_aggregates [--verbose]``

+ 401
- 10
nova/cmd/manage.py View File

@@ -23,6 +23,7 @@
23 23
 
24 24
 from __future__ import print_function
25 25
 
26
+import collections
26 27
 import functools
27 28
 import re
28 29
 import sys
@@ -32,6 +33,7 @@ from dateutil import parser as dateutil_parser
32 33
 import decorator
33 34
 from keystoneauth1 import exceptions as ks_exc
34 35
 import netaddr
36
+from neutronclient.common import exceptions as neutron_client_exc
35 37
 from oslo_config import cfg
36 38
 from oslo_db import exception as db_exc
37 39
 from oslo_log import log as logging
@@ -54,6 +56,7 @@ from nova.db import migration
54 56
 from nova.db.sqlalchemy import api as sa_db
55 57
 from nova import exception
56 58
 from nova.i18n import _
59
+from nova.network.neutronv2 import api as neutron_api
57 60
 from nova import objects
58 61
 from nova.objects import block_device as block_device_obj
59 62
 from nova.objects import build_request as build_request_obj
@@ -1658,6 +1661,290 @@ class PlacementCommands(object):
1658 1661
         node_cache[instance.node] = node_uuid
1659 1662
         return node_uuid
1660 1663
 
1664
+    @staticmethod
1665
+    def _get_ports(ctxt, instance, neutron):
1666
+        """Return the ports that are bound to the instance
1667
+
1668
+        :param ctxt: nova.context.RequestContext
1669
+        :param instance: the instance to return the ports for
1670
+        :param neutron: nova.network.neutronv2.api.ClientWrapper to
1671
+            communicate with Neutron
1672
+        :return: a list of neutron port dict objects
1673
+        :raise UnableToQueryPorts: If the neutron list ports query fails.
1674
+        """
1675
+        try:
1676
+            return neutron.list_ports(
1677
+                ctxt, device_id=instance.uuid,
1678
+                fields=['id', 'resource_request', 'binding:profile'])['ports']
1679
+        except neutron_client_exc.NeutronClientException as e:
1680
+            raise exception.UnableToQueryPorts(
1681
+                instance_uuid=instance.uuid, error=six.text_type(e))
1682
+
1683
+    @staticmethod
1684
+    def _has_request_but_no_allocation(port):
1685
+        request = port.get('resource_request')
1686
+        binding_profile = port.get('binding:profile', {}) or {}
1687
+        allocation = binding_profile.get('allocation')
1688
+        # We are defensive here about 'resources' and 'required' in the
1689
+        # 'resource_request' as neutron API is not clear about those fields
1690
+        # being optional.
1691
+        return (request and request.get('resources') and
1692
+                request.get('required') and
1693
+                not allocation)
1694
+
1695
+    @staticmethod
1696
+    def _get_rps_in_tree_with_required_traits(
1697
+            ctxt, rp_uuid, required_traits, placement):
1698
+        """Find the RPs that have all the required traits in the given rp tree.
1699
+
1700
+        :param ctxt: nova.context.RequestContext
1701
+        :param rp_uuid: the RP uuid that will be used to query the tree.
1702
+        :param required_traits: the traits that need to be supported by
1703
+            the returned resource providers.
1704
+        :param placement: nova.scheduler.client.report.SchedulerReportClient
1705
+            to communicate with the Placement service API.
1706
+        :raise PlacementAPIConnectFailure: if placement API cannot be reached
1707
+        :raise ResourceProviderRetrievalFailed: if the resource provider does
1708
+            not exist.
1709
+        :raise ResourceProviderTraitRetrievalFailed: if resource provider
1710
+            trait information cannot be read from placement.
1711
+        :return: A list of RP UUIDs that supports every required traits and
1712
+            in the tree for the provider rp_uuid.
1713
+        """
1714
+        try:
1715
+            rps = placement.get_providers_in_tree(ctxt, rp_uuid)
1716
+            matching_rps = [
1717
+                rp['uuid']
1718
+                for rp in rps
1719
+                if set(required_traits).issubset(
1720
+                    placement.get_provider_traits(ctxt, rp['uuid']).traits)
1721
+            ]
1722
+        except ks_exc.ClientException:
1723
+            raise exception.PlacementAPIConnectFailure()
1724
+
1725
+        return matching_rps
1726
+
1727
+    @staticmethod
1728
+    def _merge_allocations(alloc1, alloc2):
1729
+        """Return a new allocation dict that contains the sum of alloc1 and
1730
+        alloc2.
1731
+
1732
+        :param alloc1: a dict in the form of
1733
+            {
1734
+                <rp_uuid>: {'resources': {<resource class>: amount,
1735
+                                          <resource class>: amount},
1736
+                <rp_uuid>: {'resources': {<resource class>: amount},
1737
+            }
1738
+        :param alloc2: a dict in the same form as alloc1
1739
+        :return: the merged allocation of alloc1 and alloc2 in the same format
1740
+        """
1741
+
1742
+        allocations = collections.defaultdict(
1743
+            lambda: {'resources': collections.defaultdict(int)})
1744
+
1745
+        for alloc in [alloc1, alloc2]:
1746
+            for rp_uuid in alloc:
1747
+                for rc, amount in alloc[rp_uuid]['resources'].items():
1748
+                    allocations[rp_uuid]['resources'][rc] += amount
1749
+        return allocations
1750
+
1751
+    def _get_port_allocation(
1752
+            self, ctxt, node_uuid, port, instance_uuid, placement):
1753
+        """Return the extra allocation the instance needs due to the given
1754
+        port.
1755
+
1756
+        :param ctxt: nova.context.RequestContext
1757
+        :param node_uuid: the ComputeNode uuid the instance is running on.
1758
+        :param port: the port dict returned from neutron
1759
+        :param instance_uuid: The uuid of the instance the port is bound to
1760
+        :param placement: nova.scheduler.client.report.SchedulerReportClient
1761
+            to communicate with the Placement service API.
1762
+        :raise PlacementAPIConnectFailure: if placement API cannot be reached
1763
+        :raise ResourceProviderRetrievalFailed: compute node resource provider
1764
+            does not exist.
1765
+        :raise ResourceProviderTraitRetrievalFailed: if resource provider
1766
+            trait information cannot be read from placement.
1767
+        :raise MoreThanOneResourceProviderToHealFrom: if it cannot be decided
1768
+            unambiguously which resource provider to heal from.
1769
+        :raise NoResourceProviderToHealFrom: if there is no resource provider
1770
+            found to heal from.
1771
+        :return: A dict of resources keyed by RP uuid to be included in the
1772
+            instance allocation dict.
1773
+        """
1774
+        matching_rp_uuids = self._get_rps_in_tree_with_required_traits(
1775
+            ctxt, node_uuid, port['resource_request']['required'], placement)
1776
+
1777
+        if len(matching_rp_uuids) > 1:
1778
+            # If there is more than one such RP then it is an ambiguous
1779
+            # situation that we cannot handle here efficiently because that
1780
+            # would require the reimplementation of most of the allocation
1781
+            # candidate query functionality of placement. Also if more
1782
+            # than one such RP exists then selecting the right one might
1783
+            # need extra information from the compute node. For example
1784
+            # which PCI PF the VF is allocated from and which RP represents
1785
+            # that PCI PF in placement. When migration is supported with such
1786
+            # servers then we can ask the admin to migrate these servers
1787
+            # instead to heal their allocation.
1788
+            raise exception.MoreThanOneResourceProviderToHealFrom(
1789
+                rp_uuids=','.join(matching_rp_uuids),
1790
+                port_id=port['id'],
1791
+                instance_uuid=instance_uuid)
1792
+
1793
+        if len(matching_rp_uuids) == 0:
1794
+            raise exception.NoResourceProviderToHealFrom(
1795
+                port_id=port['id'],
1796
+                instance_uuid=instance_uuid,
1797
+                traits=port['resource_request']['required'],
1798
+                node_uuid=node_uuid)
1799
+
1800
+        # We found one RP that matches the traits. Assume that we can allocate
1801
+        # the resources from it. If there is not enough inventory left on the
1802
+        # RP then the PUT /allocations placement call will detect that.
1803
+        rp_uuid = matching_rp_uuids[0]
1804
+
1805
+        port_allocation = {
1806
+            rp_uuid: {
1807
+                'resources': port['resource_request']['resources']
1808
+            }
1809
+        }
1810
+        return port_allocation
1811
+
1812
+    def _get_port_allocations_to_heal(
1813
+            self, ctxt, instance, node_cache, placement, neutron, output):
1814
+        """Return the needed extra allocation for the ports of the instance.
1815
+
1816
+        :param ctxt: nova.context.RequestContext
1817
+        :param instance: instance to get the port allocations for
1818
+        :param node_cache: dict of Instance.node keys to ComputeNode.uuid
1819
+            values; this cache is updated if a new node is processed.
1820
+        :param placement: nova.scheduler.client.report.SchedulerReportClient
1821
+            to communicate with the Placement service API.
1822
+        :param neutron: nova.network.neutronv2.api.ClientWrapper to
1823
+            communicate with Neutron
1824
+        :param output: function that takes a single message for verbose output
1825
+        :raise UnableToQueryPorts: If the neutron list ports query fails.
1826
+        :raise nova.exception.ComputeHostNotFound: if compute node of the
1827
+            instance not found in the db.
1828
+        :raise PlacementAPIConnectFailure: if placement API cannot be reached
1829
+        :raise ResourceProviderRetrievalFailed: if the resource provider
1830
+            representing the compute node the instance is running on does not
1831
+            exist.
1832
+        :raise ResourceProviderTraitRetrievalFailed: if resource provider
1833
+            trait information cannot be read from placement.
1834
+        :raise MoreThanOneResourceProviderToHealFrom: if it cannot be decided
1835
+            unambiguously which resource provider to heal from.
1836
+        :raise NoResourceProviderToHealFrom: if there is no resource provider
1837
+            found to heal from.
1838
+        :return: A two tuple where the first item is a dict of resources keyed
1839
+            by RP uuid to be included in the instance allocation dict. The
1840
+            second item is a list of port dicts to be updated in Neutron.
1841
+        """
1842
+        # We need to heal port allocations for ports that have resource_request
1843
+        # but do not have an RP uuid in the binding:profile.allocation field.
1844
+        # We cannot use the instance info_cache to check the binding profile
1845
+        # as this code needs to be able to handle ports that were attached
1846
+        # before nova in stein started updating the allocation key in the
1847
+        # binding:profile.
1848
+        # In theory a port can be assigned to an instance without it being
1849
+        # bound to any host (e.g. in case of shelve offload) but
1850
+        # _heal_allocations_for_instance() already filters out instances that
1851
+        # are not on any host.
1852
+        ports_to_heal = [
1853
+            port for port in self._get_ports(ctxt, instance, neutron)
1854
+            if self._has_request_but_no_allocation(port)]
1855
+
1856
+        if not ports_to_heal:
1857
+            # nothing to do, return early
1858
+            return {}, []
1859
+
1860
+        node_uuid = self._get_compute_node_uuid(
1861
+            ctxt, instance, node_cache)
1862
+
1863
+        allocations = {}
1864
+        for port in ports_to_heal:
1865
+            port_allocation = self._get_port_allocation(
1866
+                ctxt, node_uuid, port, instance.uuid, placement)
1867
+            rp_uuid = list(port_allocation)[0]
1868
+            allocations = self._merge_allocations(
1869
+                allocations, port_allocation)
1870
+            # We also need to record the RP we are allocated from in the
1871
+            # port. This will be sent back to Neutron before the allocation
1872
+            # is updated in placement
1873
+            binding_profile = port.get('binding:profile', {}) or {}
1874
+            binding_profile['allocation'] = rp_uuid
1875
+            port['binding:profile'] = binding_profile
1876
+
1877
+            output(_("Found resource provider %(rp_uuid)s having matching "
1878
+                     "traits for port %(port_uuid)s with resource request "
1879
+                     "%(request)s attached to instance %(instance_uuid)s") %
1880
+                     {"rp_uuid": rp_uuid, "port_uuid": port["id"],
1881
+                      "request": port.get("resource_request"),
1882
+                      "instance_uuid": instance.uuid})
1883
+
1884
+        return allocations, ports_to_heal
1885
+
1886
+    def _update_ports(self, neutron, ports_to_update, output):
1887
+        succeeded = []
1888
+        try:
1889
+            for port in ports_to_update:
1890
+                body = {
1891
+                    'port': {
1892
+                        'binding:profile': port['binding:profile']
1893
+                    }
1894
+                }
1895
+                output(
1896
+                    _('Updating port %(port_uuid)s with attributes '
1897
+                      '%(attributes)s') %
1898
+                      {'port_uuid': port['id'], 'attributes': body['port']})
1899
+                neutron.update_port(port['id'], body=body)
1900
+                succeeded.append(port)
1901
+        except neutron_client_exc.NeutronClientException as e:
1902
+            output(
1903
+                _('Updating port %(port_uuid)s failed: %(error)s') %
1904
+                {'port_uuid': port['id'], 'error': six.text_type(e)})
1905
+            # one of the port updates failed. We need to roll back the updates
1906
+            # that succeeded before
1907
+            self._rollback_port_updates(neutron, succeeded, output)
1908
+            # we failed to heal so we need to stop but we successfully rolled
1909
+            # back the partial updates so the admin can retry the healing.
1910
+            raise exception.UnableToUpdatePorts(error=six.text_type(e))
1911
+
1912
+    @staticmethod
1913
+    def _rollback_port_updates(neutron, ports_to_rollback, output):
1914
+        # _update_ports() added the allocation key to these ports, so we need
1915
+        # to remove them during the rollback.
1916
+        manual_rollback_needed = []
1917
+        last_exc = None
1918
+        for port in ports_to_rollback:
1919
+            profile = port['binding:profile']
1920
+            profile.pop('allocation')
1921
+            body = {
1922
+                'port': {
1923
+                    'binding:profile': profile
1924
+                }
1925
+            }
1926
+            try:
1927
+                output(_('Rolling back port update for %(port_uuid)s') %
1928
+                       {'port_uuid': port['id']})
1929
+                neutron.update_port(port['id'], body=body)
1930
+            except neutron_client_exc.NeutronClientException as e:
1931
+                output(
1932
+                    _('Rolling back update for port %(port_uuid)s failed: '
1933
+                      '%(error)s') % {'port_uuid': port['id'],
1934
+                                      'error': six.text_type(e)})
1935
+                # TODO(gibi): We could implement a retry mechanism with
1936
+                # back off.
1937
+                manual_rollback_needed.append(port['id'])
1938
+                last_exc = e
1939
+
1940
+        if manual_rollback_needed:
1941
+            # At least one of the port operation failed so we failed to roll
1942
+            # back. There are partial updates in neutron. Human intervention
1943
+            # needed.
1944
+            raise exception.UnableToRollbackPortUpdates(
1945
+                error=six.text_type(last_exc),
1946
+                port_uuids=manual_rollback_needed)
1947
+
1661 1948
     def _heal_missing_alloc(self, ctxt, instance, node_cache):
1662 1949
         node_uuid = self._get_compute_node_uuid(
1663 1950
             ctxt, instance, node_cache)
@@ -1683,18 +1970,23 @@ class PlacementCommands(object):
1683 1970
         return allocations
1684 1971
 
1685 1972
     def _heal_allocations_for_instance(self, ctxt, instance, node_cache,
1686
-                                       output, placement, dry_run):
1973
+                                       output, placement, dry_run,
1974
+                                       heal_port_allocations, neutron):
1687 1975
         """Checks the given instance to see if it needs allocation healing
1688 1976
 
1689 1977
         :param ctxt: cell-targeted nova.context.RequestContext
1690 1978
         :param instance: the instance to check for allocation healing
1691 1979
         :param node_cache: dict of Instance.node keys to ComputeNode.uuid
1692 1980
             values; this cache is updated if a new node is processed.
1693
-        :param outout: function that takes a single message for verbose output
1981
+        :param output: function that takes a single message for verbose output
1694 1982
         :param placement: nova.scheduler.client.report.SchedulerReportClient
1695 1983
             to communicate with the Placement service API.
1696 1984
         :param dry_run: Process instances and print output but do not commit
1697 1985
             any changes.
1986
+        :param heal_port_allocations: True if healing port allocation is
1987
+            requested, False otherwise.
1988
+        :param neutron: nova.network.neutronv2.api.ClientWrapper to
1989
+            communicate with Neutron
1698 1990
         :return: True if allocations were created or updated for the instance,
1699 1991
             None if nothing needed to be done
1700 1992
         :raises: nova.exception.ComputeHostNotFound if a compute node for a
@@ -1703,6 +1995,21 @@ class PlacementCommands(object):
1703 1995
             a given instance against a given compute node resource provider
1704 1996
         :raises: AllocationUpdateFailed if unable to update allocations for
1705 1997
             a given instance with consumer project/user information
1998
+        :raise UnableToQueryPorts: If the neutron list ports query fails.
1999
+        :raise PlacementAPIConnectFailure: if placement API cannot be reached
2000
+        :raise ResourceProviderRetrievalFailed: if the resource provider
2001
+            representing the compute node the instance is running on does not
2002
+            exist.
2003
+        :raise ResourceProviderTraitRetrievalFailed: if resource provider
2004
+            trait information cannot be read from placement.
2005
+        :raise MoreThanOneResourceProviderToHealFrom: if it cannot be decided
2006
+            unambiguously which resource provider to heal from.
2007
+        :raise NoResourceProviderToHealFrom: if there is no resource provider
2008
+            found to heal from.
2009
+        :raise UnableToUpdatePorts: if a port update failed in neutron but any
2010
+            partial update was rolled back successfully.
2011
+        :raise UnableToRollbackPortUpdates: if a port update failed in neutron
2012
+            and the rollback of the partial updates also failed.
1706 2013
         """
1707 2014
         if instance.task_state is not None:
1708 2015
             output(_('Instance %(instance)s is undergoing a task '
@@ -1744,6 +2051,19 @@ class PlacementCommands(object):
1744 2051
             allocations = self._heal_missing_project_and_user_id(
1745 2052
                 allocations, instance)
1746 2053
 
2054
+        if heal_port_allocations:
2055
+            to_heal = self._get_port_allocations_to_heal(
2056
+                ctxt, instance, node_cache, placement, neutron, output)
2057
+            port_allocations, ports_to_update = to_heal
2058
+        else:
2059
+            port_allocations, ports_to_update = {}, []
2060
+
2061
+        if port_allocations:
2062
+            need_healing = need_healing or 'Update'
2063
+            # Merge in any missing port allocations
2064
+            allocations['allocations'] = self._merge_allocations(
2065
+                allocations['allocations'], port_allocations)
2066
+
1747 2067
         if need_healing:
1748 2068
             if dry_run:
1749 2069
                 output(_('[dry-run] %(operation)s allocations for instance '
@@ -1752,6 +2072,16 @@ class PlacementCommands(object):
1752 2072
                         'instance': instance.uuid,
1753 2073
                         'allocations': allocations})
1754 2074
             else:
2075
+                # First update ports in neutron. If any of those operations
2076
+                # fail, then roll back the successful part of it and fail the
2077
+                # healing. We do this first because rolling back the port
2078
+                # updates is more straight-forward than rolling back allocation
2079
+                # changes.
2080
+                self._update_ports(neutron, ports_to_update, output)
2081
+
2082
+                # Now that neutron update succeeded we can try to update
2083
+                # placement. If it fails we need to rollback every neutron port
2084
+                # update done before.
1755 2085
                 resp = placement.put_allocations(ctxt, instance.uuid,
1756 2086
                                                  allocations)
1757 2087
                 if resp:
@@ -1761,15 +2091,24 @@ class PlacementCommands(object):
1761 2091
                             'instance': instance.uuid})
1762 2092
                     return True
1763 2093
                 else:
2094
+                    # Rollback every neutron update. If we succeed to
2095
+                    # roll back then it is safe to stop here and let the admin
2096
+                    # retry. If the rollback fails then
2097
+                    # _rollback_port_updates() will raise another exception
2098
+                    # that instructs the operator how to clean up manually
2099
+                    # before the healing can be retried
2100
+                    self._rollback_port_updates(
2101
+                        neutron, ports_to_update, output)
1764 2102
                     raise exception.AllocationUpdateFailed(
1765 2103
                         consumer_uuid=instance.uuid, error='')
1766 2104
         else:
1767
-            output(_('Instance %s already has allocations with '
1768
-                     'matching consumer project/user.') % instance.uuid)
2105
+            output(_('The allocation of instance %s is up-to-date. '
2106
+                     'Nothing to be healed.') % instance.uuid)
1769 2107
             return
1770 2108
 
1771 2109
     def _heal_instances_in_cell(self, ctxt, max_count, unlimited, output,
1772
-                                placement, dry_run, instance_uuid):
2110
+                                placement, dry_run, instance_uuid,
2111
+                                heal_port_allocations, neutron):
1773 2112
         """Checks for instances to heal in a given cell.
1774 2113
 
1775 2114
         :param ctxt: cell-targeted nova.context.RequestContext
@@ -1782,6 +2121,10 @@ class PlacementCommands(object):
1782 2121
         :param dry_run: Process instances and print output but do not commit
1783 2122
             any changes.
1784 2123
         :param instance_uuid: UUID of a specific instance to process.
2124
+        :param heal_port_allocations: True if healing port allocation is
2125
+            requested, False otherwise.
2126
+        :param neutron: nova.network.neutronv2.api.ClientWrapper to
2127
+            communicate with Neutron
1785 2128
         :return: Number of instances that had allocations created.
1786 2129
         :raises: nova.exception.ComputeHostNotFound if a compute node for a
1787 2130
             given instance cannot be found
@@ -1789,6 +2132,21 @@ class PlacementCommands(object):
1789 2132
             a given instance against a given compute node resource provider
1790 2133
         :raises: AllocationUpdateFailed if unable to update allocations for
1791 2134
             a given instance with consumer project/user information
2135
+        :raise UnableToQueryPorts: If the neutron list ports query fails.
2136
+        :raise PlacementAPIConnectFailure: if placement API cannot be reached
2137
+        :raise ResourceProviderRetrievalFailed: if the resource provider
2138
+            representing the compute node the instance is running on does not
2139
+            exist.
2140
+        :raise ResourceProviderTraitRetrievalFailed: if resource provider
2141
+            trait information cannot be read from placement.
2142
+        :raise MoreThanOneResourceProviderToHealFrom: if it cannot be decided
2143
+            unambiguously which resource provider to heal from.
2144
+        :raise NoResourceProviderToHealFrom: if there is no resource provider
2145
+            found to heal from.
2146
+        :raise UnableToUpdatePorts: if a port update failed in neutron but any
2147
+            partial update was rolled back successfully.
2148
+        :raise UnableToRollbackPortUpdates: if a port update failed in neutron
2149
+            and the rollback of the partial updates also failed.
1792 2150
         """
1793 2151
         # Keep a cache of instance.node to compute node resource provider UUID.
1794 2152
         # This will save some queries for non-ironic instances to the
@@ -1820,7 +2178,7 @@ class PlacementCommands(object):
1820 2178
             for instance in instances:
1821 2179
                 if self._heal_allocations_for_instance(
1822 2180
                         ctxt, instance, node_cache, output, placement,
1823
-                        dry_run):
2181
+                        dry_run, heal_port_allocations, neutron):
1824 2182
                     num_processed += 1
1825 2183
 
1826 2184
             # Make sure we don't go over the max count. Note that we
@@ -1843,7 +2201,8 @@ class PlacementCommands(object):
1843 2201
     @action_description(
1844 2202
         _("Iterates over non-cell0 cells looking for instances which do "
1845 2203
           "not have allocations in the Placement service, or have incomplete "
1846
-          "consumer project_id/user_id values in existing allocations, and "
2204
+          "consumer project_id/user_id values in existing allocations or "
2205
+          "missing allocations for ports having resource request, and "
1847 2206
           "which are not undergoing a task state transition. For each "
1848 2207
           "instance found, allocations are created (or updated) against the "
1849 2208
           "compute node resource provider for that instance based on the "
@@ -1864,8 +2223,16 @@ class PlacementCommands(object):
1864 2223
     @args('--instance', metavar='<instance_uuid>', dest='instance_uuid',
1865 2224
           help='UUID of a specific instance to process. If specified '
1866 2225
                '--max-count has no effect.')
2226
+    @args('--skip-port-allocations', action='store_true',
2227
+          dest='skip_port_allocations', default=False,
2228
+          help='Skip the healing of the resource allocations of bound ports. '
2229
+               'E.g. healing bandwidth resource allocation for ports having '
2230
+               'minimum QoS policy rules attached. If your deployment does '
2231
+               'not use such a feature then the performance impact of '
2232
+               'querying neutron ports for each instance can be avoided with '
2233
+               'this flag.')
1867 2234
     def heal_allocations(self, max_count=None, verbose=False, dry_run=False,
1868
-                         instance_uuid=None):
2235
+                         instance_uuid=None, skip_port_allocations=False):
1869 2236
         """Heals instance allocations in the Placement service
1870 2237
 
1871 2238
         Return codes:
@@ -1876,6 +2243,9 @@ class PlacementCommands(object):
1876 2243
         * 3: Unable to create (or update) allocations for an instance against
1877 2244
              its compute node resource provider.
1878 2245
         * 4: Command completed successfully but no allocations were created.
2246
+        * 5: Unable to query ports from neutron
2247
+        * 6: Unable to update ports in neutron
2248
+        * 7: Cannot roll back neutron port updates. Manual steps needed.
1879 2249
         * 127: Invalid input.
1880 2250
         """
1881 2251
         # NOTE(mriedem): Thoughts on ways to expand this:
@@ -1891,6 +2261,8 @@ class PlacementCommands(object):
1891 2261
         #   would probably only be safe with a specific instance.
1892 2262
         # - deal with nested resource providers?
1893 2263
 
2264
+        heal_port_allocations = not skip_port_allocations
2265
+
1894 2266
         output = lambda msg: None
1895 2267
         if verbose:
1896 2268
             output = lambda msg: print(msg)
@@ -1937,6 +2309,11 @@ class PlacementCommands(object):
1937 2309
                 return 4
1938 2310
 
1939 2311
         placement = report.SchedulerReportClient()
2312
+
2313
+        neutron = None
2314
+        if heal_port_allocations:
2315
+            neutron = neutron_api.get_client(ctxt, admin=True)
2316
+
1940 2317
         num_processed = 0
1941 2318
         # TODO(mriedem): Use context.scatter_gather_skip_cell0.
1942 2319
         for cell in cells:
@@ -1957,14 +2334,28 @@ class PlacementCommands(object):
1957 2334
                 try:
1958 2335
                     num_processed += self._heal_instances_in_cell(
1959 2336
                         cctxt, limit_per_cell, unlimited, output, placement,
1960
-                        dry_run, instance_uuid)
2337
+                        dry_run, instance_uuid, heal_port_allocations, neutron)
1961 2338
                 except exception.ComputeHostNotFound as e:
1962 2339
                     print(e.format_message())
1963 2340
                     return 2
1964 2341
                 except (exception.AllocationCreateFailed,
1965
-                        exception.AllocationUpdateFailed) as e:
2342
+                        exception.AllocationUpdateFailed,
2343
+                        exception.NoResourceProviderToHealFrom,
2344
+                        exception.MoreThanOneResourceProviderToHealFrom,
2345
+                        exception.PlacementAPIConnectFailure,
2346
+                        exception.ResourceProviderRetrievalFailed,
2347
+                        exception.ResourceProviderTraitRetrievalFailed) as e:
1966 2348
                     print(e.format_message())
1967 2349
                     return 3
2350
+                except exception.UnableToQueryPorts as e:
2351
+                    print(e.format_message())
2352
+                    return 5
2353
+                except exception.UnableToUpdatePorts as e:
2354
+                    print(e.format_message())
2355
+                    return 6
2356
+                except exception.UnableToRollbackPortUpdates as e:
2357
+                    print(e.format_message())
2358
+                    return 7
1968 2359
 
1969 2360
                 # Make sure we don't go over the max count. Note that we
1970 2361
                 # don't include instances that already have allocations in the

+ 50
- 0
nova/exception.py View File

@@ -2440,3 +2440,53 @@ class ReshapeFailed(NovaException):
2440 2440
 class ReshapeNeeded(NovaException):
2441 2441
     msg_fmt = _("Virt driver indicates that provider inventories need to be "
2442 2442
                 "moved.")
2443
+
2444
+
2445
+class HealPortAllocationException(NovaException):
2446
+    msg_fmt = _("Healing port allocation failed.")
2447
+
2448
+
2449
+class MoreThanOneResourceProviderToHealFrom(HealPortAllocationException):
2450
+    msg_fmt = _("More than one matching resource provider %(rp_uuids)s is "
2451
+                "available for healing the port allocation for port "
2452
+                "%(port_id)s for instance %(instance_uuid)s. This script "
2453
+                "does not have enough information to select the proper "
2454
+                "resource provider from which to heal.")
2455
+
2456
+
2457
+class NoResourceProviderToHealFrom(HealPortAllocationException):
2458
+    msg_fmt = _("No matching resource provider is "
2459
+                "available for healing the port allocation for port "
2460
+                "%(port_id)s for instance %(instance_uuid)s. There are no "
2461
+                "resource providers with matching traits %(traits)s in the "
2462
+                "provider tree of the resource provider %(node_uuid)s ."
2463
+                "This probably means that the neutron QoS configuration is "
2464
+                "wrong. Consult with "
2465
+                "https://docs.openstack.org/neutron/latest/admin/"
2466
+                "config-qos-min-bw.html for information on how to configure "
2467
+                "neutron. If the configuration is fixed the script can be run "
2468
+                "again.")
2469
+
2470
+
2471
+class UnableToQueryPorts(HealPortAllocationException):
2472
+    msg_fmt = _("Unable to query ports for instance %(instance_uuid)s: "
2473
+                "%(error)s")
2474
+
2475
+
2476
+class UnableToUpdatePorts(HealPortAllocationException):
2477
+    msg_fmt = _("Unable to update ports with allocations that are about to be "
2478
+                "created in placement: %(error)s. The healing of the "
2479
+                "instance is aborted. It is safe to try to heal the instance "
2480
+                "again.")
2481
+
2482
+
2483
+class UnableToRollbackPortUpdates(HealPortAllocationException):
2484
+    msg_fmt = _("Failed to update neutron ports with allocation keys and the "
2485
+                "automatic rollback of the previously successful port updates "
2486
+                "also failed: %(error)s. Make sure that the "
2487
+                "binding:profile.allocation key of the affected ports "
2488
+                "%(port_uuids)s are manually cleaned in neutron according to "
2489
+                "document https://docs.openstack.org/nova/latest/cli/"
2490
+                "nova-manage.html#placement. If you re-run the script without "
2491
+                "the manual fix then the missing allocation for these ports "
2492
+                "will not be healed in placement.")

+ 3
- 1
nova/tests/fixtures.py View File

@@ -1601,7 +1601,9 @@ class NeutronFixture(fixtures.Fixture):
1601 1601
 
1602 1602
     def update_port(self, port_id, body=None):
1603 1603
         port = self._ports[port_id]
1604
-        port.update(body['port'])
1604
+        # We need to deepcopy here as well as the body can have a nested dict
1605
+        # which can be modified by the caller after this update_port call
1606
+        port.update(copy.deepcopy(body['port']))
1605 1607
         return {'port': copy.deepcopy(port)}
1606 1608
 
1607 1609
     def show_quota(self, project_id):

+ 550
- 3
nova/tests/functional/test_nova_manage.py View File

@@ -11,18 +11,24 @@
11 11
 #    under the License.
12 12
 from __future__ import absolute_import
13 13
 
14
+import collections
14 15
 import mock
15 16
 
16 17
 import fixtures
18
+from neutronclient.common import exceptions as neutron_client_exc
19
+import os_resource_classes as orc
20
+from oslo_utils.fixture import uuidsentinel
17 21
 from six.moves import StringIO
18 22
 
19 23
 from nova.cmd import manage
20 24
 from nova import config
21 25
 from nova import context
26
+from nova import exception
22 27
 from nova import objects
23 28
 from nova import test
24 29
 from nova.tests import fixtures as nova_fixtures
25 30
 from nova.tests.functional import integrated_helpers
31
+from nova.tests.functional import test_servers
26 32
 
27 33
 CONF = config.CONF
28 34
 INCOMPLETE_CONSUMER_ID = '00000000-0000-0000-0000-000000000000'
@@ -474,9 +480,9 @@ class TestNovaManagePlacementHealAllocations(
474 480
             self.assertIn('Max count reached. Processed 1 instances.', output)
475 481
             # If this is the 2nd call, we'll have skipped the first instance.
476 482
             if x == 0:
477
-                self.assertNotIn('already has allocations', output)
483
+                self.assertNotIn('is up-to-date', output)
478 484
             else:
479
-                self.assertIn('already has allocations', output)
485
+                self.assertIn('is up-to-date', output)
480 486
 
481 487
         self._assert_healed(server1, rp_uuid1)
482 488
         self._assert_healed(server2, rp_uuid2)
@@ -484,7 +490,7 @@ class TestNovaManagePlacementHealAllocations(
484 490
         # run it again to make sure nothing was processed
485 491
         result = self.cli.heal_allocations(verbose=True)
486 492
         self.assertEqual(4, result, self.output.getvalue())
487
-        self.assertIn('already has allocations', self.output.getvalue())
493
+        self.assertIn('is up-to-date', self.output.getvalue())
488 494
 
489 495
     def test_heal_allocations_paging_max_count_more_than_num_instances(self):
490 496
         """Sets up 2 instances in cell1 and 1 instance in cell2. Then specify
@@ -741,6 +747,547 @@ class TestNovaManagePlacementHealAllocations(
741 747
                       server['id'], output)
742 748
 
743 749
 
750
+class TestNovaManagePlacementHealPortAllocations(
751
+        test_servers.PortResourceRequestBasedSchedulingTestBase):
752
+
753
+    def setUp(self):
754
+        super(TestNovaManagePlacementHealPortAllocations, self).setUp()
755
+        self.cli = manage.PlacementCommands()
756
+        self.flavor = self.api.get_flavors()[0]
757
+        self.output = StringIO()
758
+        self.useFixture(fixtures.MonkeyPatch('sys.stdout', self.output))
759
+
760
+        # Make it easier to debug failed test cases
761
+        def print_stdout_on_fail(*args, **kwargs):
762
+            import sys
763
+            sys.stderr.write(self.output.getvalue())
764
+
765
+        self.addOnException(print_stdout_on_fail)
766
+
767
+    def _add_resource_request_to_a_bound_port(self, port_id, resource_request):
768
+        # NOTE(gibi): self.neutron._ports contains a copy of each neutron port
769
+        # defined on class level in the fixture. So modifying what is in the
770
+        # _ports list is safe as it is re-created for each Neutron fixture
771
+        # instance therefore for each individual test using that fixture.
772
+        bound_port = self.neutron._ports[port_id]
773
+        bound_port['resource_request'] = resource_request
774
+
775
+    def _create_server_with_missing_port_alloc(
776
+            self, ports, resource_request=None):
777
+        if not resource_request:
778
+            resource_request = {
779
+                "resources": {
780
+                    orc.NET_BW_IGR_KILOBIT_PER_SEC: 1000,
781
+                    orc.NET_BW_EGR_KILOBIT_PER_SEC: 1000},
782
+                "required": ["CUSTOM_PHYSNET2", "CUSTOM_VNIC_TYPE_NORMAL"]
783
+            }
784
+
785
+        server = self._create_server(
786
+            flavor=self.flavor,
787
+            networks=[{'port': port['id']} for port in ports])
788
+        server = self._wait_for_state_change(self.admin_api, server, 'ACTIVE')
789
+
790
+        # This is a hack to simulate that we have a server that is missing
791
+        # allocation for its port
792
+        for port in ports:
793
+            self._add_resource_request_to_a_bound_port(
794
+                port['id'], resource_request)
795
+
796
+        updated_ports = [
797
+            self.neutron.show_port(port['id'])['port'] for port in ports]
798
+
799
+        return server, updated_ports
800
+
801
+    def _assert_placement_updated(self, server, ports):
802
+        rsp = self.placement_api.get(
803
+            '/allocations/%s' % server['id'],
804
+            version=1.28).body
805
+
806
+        allocations = rsp['allocations']
807
+
808
+        # we expect one allocation for the compute resources and one for the
809
+        # networking resources
810
+        self.assertEqual(2, len(allocations))
811
+        self.assertEqual(
812
+            self._resources_from_flavor(self.flavor),
813
+            allocations[self.compute1_rp_uuid]['resources'])
814
+
815
+        self.assertEqual(server['tenant_id'], rsp['project_id'])
816
+        self.assertEqual(server['user_id'], rsp['user_id'])
817
+
818
+        network_allocations = allocations[
819
+            self.ovs_bridge_rp_per_host[self.compute1_rp_uuid]]['resources']
820
+
821
+        # this code assumes that every port is allocated from the same OVS
822
+        # bridge RP
823
+        total_request = collections.defaultdict(int)
824
+        for port in ports:
825
+            port_request = port['resource_request']['resources']
826
+            for rc, amount in port_request.items():
827
+                total_request[rc] += amount
828
+        self.assertEqual(total_request, network_allocations)
829
+
830
+    def _assert_port_updated(self, port_uuid):
831
+        updated_port = self.neutron.show_port(port_uuid)['port']
832
+        binding_profile = updated_port.get('binding:profile', {})
833
+        self.assertEqual(
834
+            self.ovs_bridge_rp_per_host[self.compute1_rp_uuid],
835
+            binding_profile['allocation'])
836
+
837
+    def _assert_ports_updated(self, ports):
838
+        for port in ports:
839
+            self._assert_port_updated(port['id'])
840
+
841
+    def _assert_placement_not_updated(self, server):
842
+        allocations = self.placement_api.get(
843
+            '/allocations/%s' % server['id']).body['allocations']
844
+        self.assertEqual(1, len(allocations))
845
+        self.assertIn(self.compute1_rp_uuid, allocations)
846
+
847
+    def _assert_port_not_updated(self, port_uuid):
848
+        updated_port = self.neutron.show_port(port_uuid)['port']
849
+        binding_profile = updated_port.get('binding:profile', {})
850
+        self.assertNotIn('allocation', binding_profile)
851
+
852
+    def _assert_ports_not_updated(self, ports):
853
+        for port in ports:
854
+            self._assert_port_not_updated(port['id'])
855
+
856
+    def test_heal_port_allocation_only(self):
857
+        """Test that only port allocation needs to be healed for an instance.
858
+
859
+        * boot with a neutron port that does not have resource request
860
+        * hack in a resource request for the bound port
861
+        * heal the allocation
862
+        * check if the port allocation is created in placement and the port
863
+          is updated in neutron
864
+
865
+        """
866
+        server, ports = self._create_server_with_missing_port_alloc(
867
+            [self.neutron.port_1])
868
+
869
+        # let's trigger a heal
870
+        result = self.cli.heal_allocations(verbose=True, max_count=2)
871
+
872
+        self._assert_placement_updated(server, ports)
873
+        self._assert_ports_updated(ports)
874
+
875
+        self.assertIn(
876
+            'Successfully updated allocations',
877
+            self.output.getvalue())
878
+        self.assertEqual(0, result)
879
+
880
+    def test_no_healing_is_needed(self):
881
+        """Test that the instance has a port that has allocations
882
+        so nothing to be healed.
883
+        """
884
+        server, ports = self._create_server_with_missing_port_alloc(
885
+            [self.neutron.port_1])
886
+
887
+        # heal it once
888
+        result = self.cli.heal_allocations(verbose=True, max_count=2)
889
+
890
+        self._assert_placement_updated(server, ports)
891
+        self._assert_ports_updated(ports)
892
+
893
+        self.assertIn(
894
+            'Successfully updated allocations',
895
+            self.output.getvalue())
896
+        self.assertEqual(0, result)
897
+
898
+        # try to heal it again
899
+        result = self.cli.heal_allocations(verbose=True, max_count=2)
900
+
901
+        # nothing is removed
902
+        self._assert_placement_updated(server, ports)
903
+        self._assert_ports_updated(ports)
904
+
905
+        # healing was not needed
906
+        self.assertIn(
907
+            'Nothing to be healed.',
908
+            self.output.getvalue())
909
+        self.assertEqual(4, result)
910
+
911
+    def test_skip_heal_port_allocation(self):
912
+        """Test that only port allocation needs to be healed for an instance
913
+        but port healing is skipped on the cli.
914
+        """
915
+        server, ports = self._create_server_with_missing_port_alloc(
916
+            [self.neutron.port_1])
917
+
918
+        # let's trigger a heal
919
+        result = self.cli.heal_allocations(
920
+            verbose=True, max_count=2, skip_port_allocations=True)
921
+
922
+        self._assert_placement_not_updated(server)
923
+        self._assert_ports_not_updated(ports)
924
+
925
+        output = self.output.getvalue()
926
+        self.assertNotIn('Updating port', output)
927
+        self.assertIn('Nothing to be healed', output)
928
+        self.assertEqual(4, result)
929
+
930
+    def test_skip_heal_port_allocation_but_heal_the_rest(self):
931
+        """Test that the instance doesn't have allocation at all, needs
932
+        allocation for ports as well, but only heal the non port related
933
+        allocation.
934
+        """
935
+        server, ports = self._create_server_with_missing_port_alloc(
936
+            [self.neutron.port_1])
937
+
938
+        # delete the server allocation in placement to simulate that it needs
939
+        # to be healed
940
+
941
+        # NOTE(gibi): putting empty allocation will delete the consumer in
942
+        # placement
943
+        allocations = self.placement_api.get(
944
+            '/allocations/%s' % server['id'], version=1.28).body
945
+        allocations['allocations'] = {}
946
+        self.placement_api.put(
947
+            '/allocations/%s' % server['id'], allocations, version=1.28)
948
+
949
+        # let's trigger a heal
950
+        result = self.cli.heal_allocations(
951
+            verbose=True, max_count=2, skip_port_allocations=True)
952
+
953
+        # this actually checks that the server has its non port related
954
+        # allocation in placement
955
+        self._assert_placement_not_updated(server)
956
+        self._assert_ports_not_updated(ports)
957
+
958
+        output = self.output.getvalue()
959
+        self.assertIn(
960
+            'Successfully created allocations for instance', output)
961
+        self.assertEqual(0, result)
962
+
963
+    def test_heal_port_allocation_and_project_id(self):
964
+        """Test that not just port allocation needs to be healed but also the
965
+        missing project_id and user_id.
966
+        """
967
+        server, ports = self._create_server_with_missing_port_alloc(
968
+            [self.neutron.port_1])
969
+
970
+        # override allocation with  placement microversion <1.8 to simulate
971
+        # missing project_id and user_id
972
+        alloc_body = {
973
+            "allocations": [
974
+                {
975
+                    "resource_provider": {
976
+                        "uuid": self.compute1_rp_uuid
977
+                    },
978
+                    "resources": {
979
+                        "MEMORY_MB": self.flavor['ram'],
980
+                        "VCPU": self.flavor['vcpus'],
981
+                        "DISK_GB": self.flavor['disk']
982
+                    }
983
+                }
984
+            ]
985
+        }
986
+        self.placement_api.put('/allocations/%s' % server['id'], alloc_body)
987
+
988
+        # let's trigger a heal
989
+        result = self.cli.heal_allocations(verbose=True, max_count=2)
990
+
991
+        self._assert_placement_updated(server, ports)
992
+        self._assert_ports_updated(ports)
993
+
994
+        output = self.output.getvalue()
995
+
996
+        self.assertIn(
997
+            'Successfully updated allocations for instance', output)
998
+        self.assertIn('Processed 1 instances.', output)
999
+
1000
+        self.assertEqual(0, result)
1001
+
1002
+    def test_heal_allocation_create_allocation_with_port_allocation(self):
1003
+        """Test that the instance doesn't have allocation at all but needs
1004
+        allocation for the ports as well.
1005
+        """
1006
+        server, ports = self._create_server_with_missing_port_alloc(
1007
+            [self.neutron.port_1])
1008
+
1009
+        # delete the server allocation in placement to simulate that it needs
1010
+        # to be healed
1011
+
1012
+        # NOTE(gibi): putting empty allocation will delete the consumer in
1013
+        # placement
1014
+        allocations = self.placement_api.get(
1015
+            '/allocations/%s' % server['id'], version=1.28).body
1016
+        allocations['allocations'] = {}
1017
+        self.placement_api.put(
1018
+            '/allocations/%s' % server['id'], allocations, version=1.28)
1019
+
1020
+        # let's trigger a heal
1021
+        result = self.cli.heal_allocations(verbose=True, max_count=2)
1022
+
1023
+        self._assert_placement_updated(server, ports)
1024
+        self._assert_ports_updated(ports)
1025
+
1026
+        output = self.output.getvalue()
1027
+        self.assertIn(
1028
+            'Successfully created allocations for instance', output)
1029
+        self.assertEqual(0, result)
1030
+
1031
+    def test_heal_port_allocation_not_enough_resources_for_port(self):
1032
+        """Test that a port needs allocation but not enough inventory
1033
+        available.
1034
+        """
1035
+        # The port will request too much NET_BW_IGR_KILOBIT_PER_SEC so there is
1036
+        # no RP on the host that can provide it.
1037
+        resource_request = {
1038
+            "resources": {
1039
+                orc.NET_BW_IGR_KILOBIT_PER_SEC: 100000000000,
1040
+                orc.NET_BW_EGR_KILOBIT_PER_SEC: 1000},
1041
+            "required": ["CUSTOM_PHYSNET2",
1042
+                         "CUSTOM_VNIC_TYPE_NORMAL"]
1043
+        }
1044
+        server, ports = self._create_server_with_missing_port_alloc(
1045
+            [self.neutron.port_1], resource_request)
1046
+
1047
+        # let's trigger a heal
1048
+        result = self.cli.heal_allocations(verbose=True, max_count=2)
1049
+
1050
+        self._assert_placement_not_updated(server)
1051
+        # Actually the ports were updated but the update is rolled back when
1052
+        # the placement update failed
1053
+        self._assert_ports_not_updated(ports)
1054
+
1055
+        output = self.output.getvalue()
1056
+        self.assertIn(
1057
+            'Rolling back port update',
1058
+            output)
1059
+        self.assertIn(
1060
+            'Failed to update allocations for consumer',
1061
+            output)
1062
+        self.assertEqual(3, result)
1063
+
1064
+    def test_heal_port_allocation_no_rp_providing_required_traits(self):
1065
+        """Test that a port needs allocation but no rp is providing the
1066
+        required traits.
1067
+        """
1068
+        # The port will request a trait, CUSTOM_PHYSNET_NONEXISTENT that will
1069
+        # not be provided by any RP on this host
1070
+        resource_request = {
1071
+            "resources": {
1072
+                orc.NET_BW_IGR_KILOBIT_PER_SEC: 1000,
1073
+                orc.NET_BW_EGR_KILOBIT_PER_SEC: 1000},
1074
+            "required": ["CUSTOM_PHYSNET_NONEXISTENT",
1075
+                         "CUSTOM_VNIC_TYPE_NORMAL"]
1076
+        }
1077
+        server, ports = self._create_server_with_missing_port_alloc(
1078
+            [self.neutron.port_1], resource_request)
1079
+
1080
+        # let's trigger a heal
1081
+        result = self.cli.heal_allocations(verbose=True, max_count=2)
1082
+
1083
+        self._assert_placement_not_updated(server)
1084
+        self._assert_ports_not_updated(ports)
1085
+
1086
+        self.assertIn(
1087
+            'No matching resource provider is available for healing the port '
1088
+            'allocation',
1089
+            self.output.getvalue())
1090
+        self.assertEqual(3, result)
1091
+
1092
+    def test_heal_port_allocation_ambiguous_rps(self):
1093
+        """Test that there are more than one matching RPs are available on the
1094
+        compute.
1095
+        """
1096
+
1097
+        # The port will request CUSTOM_VNIC_TYPE_DIRECT trait and there are
1098
+        # two RPs that supports such trait.
1099
+        resource_request = {
1100
+            "resources": {
1101
+                orc.NET_BW_IGR_KILOBIT_PER_SEC: 1000,
1102
+                orc.NET_BW_EGR_KILOBIT_PER_SEC: 1000},
1103
+            "required": ["CUSTOM_PHYSNET2",
1104
+                         "CUSTOM_VNIC_TYPE_DIRECT"]
1105
+        }
1106
+        server, ports = self._create_server_with_missing_port_alloc(
1107
+            [self.neutron.port_1], resource_request)
1108
+
1109
+        # let's trigger a heal
1110
+        result = self.cli.heal_allocations(verbose=True, max_count=2)
1111
+
1112
+        self._assert_placement_not_updated(server)
1113
+        self._assert_ports_not_updated(ports)
1114
+
1115
+        self.assertIn(
1116
+            'More than one matching resource provider',
1117
+            self.output.getvalue())
1118
+        self.assertEqual(3, result)
1119
+
1120
+    def test_heal_port_allocation_neutron_unavailable_during_port_query(self):
1121
+        """Test that Neutron is not available when querying ports.
1122
+        """
1123
+        server, ports = self._create_server_with_missing_port_alloc(
1124
+            [self.neutron.port_1])
1125
+
1126
+        with mock.patch.object(
1127
+                self.neutron, "list_ports",
1128
+                side_effect=neutron_client_exc.Unauthorized()):
1129
+            # let's trigger a heal
1130
+            result = self.cli.heal_allocations(verbose=True, max_count=2)
1131
+
1132
+        self._assert_placement_not_updated(server)
1133
+        self._assert_ports_not_updated(ports)
1134
+
1135
+        self.assertIn(
1136
+            'Unable to query ports for instance',
1137
+            self.output.getvalue())
1138
+        self.assertEqual(5, result)
1139
+
1140
+    def test_heal_port_allocation_neutron_unavailable(self):
1141
+        """Test that the port cannot be updated in Neutron with RP uuid as
1142
+        Neutron is unavailable.
1143
+        """
1144
+        server, ports = self._create_server_with_missing_port_alloc(
1145
+            [self.neutron.port_1])
1146
+
1147
+        with mock.patch.object(
1148
+                self.neutron, "update_port",
1149
+                side_effect=neutron_client_exc.Forbidden()):
1150
+            # let's trigger a heal
1151
+            result = self.cli.heal_allocations(verbose=True, max_count=2)
1152
+
1153
+        self._assert_placement_not_updated(server)
1154
+        self._assert_ports_not_updated(ports)
1155
+
1156
+        self.assertIn(
1157
+            'Unable to update ports with allocations',
1158
+            self.output.getvalue())
1159
+        self.assertEqual(6, result)
1160
+
1161
+    def test_heal_multiple_port_allocations_rollback_success(self):
1162
+        """Test neutron port update rollback happy case. Try to heal two ports
1163
+        and make the second port update to fail in neutron. Assert that the
1164
+        first port update rolled back successfully.
1165
+        """
1166
+        port2 = self.neutron.create_port()['port']
1167
+        server, ports = self._create_server_with_missing_port_alloc(
1168
+            [self.neutron.port_1, port2])
1169
+
1170
+        orig_update_port = self.neutron.update_port
1171
+        update = []
1172
+
1173
+        def fake_update_port(*args, **kwargs):
1174
+            if len(update) == 0 or len(update) > 1:
1175
+                update.append(True)
1176
+                return orig_update_port(*args, **kwargs)
1177
+            if len(update) == 1:
1178
+                update.append(True)
1179
+                raise neutron_client_exc.Forbidden()
1180
+
1181
+        with mock.patch.object(
1182
+                self.neutron, "update_port", side_effect=fake_update_port):
1183
+            # let's trigger a heal
1184
+            result = self.cli.heal_allocations(verbose=True, max_count=2)
1185
+
1186
+        self._assert_placement_not_updated(server)
1187
+        # Actually one of the ports were updated but the update is rolled
1188
+        # back when the second neutron port update failed
1189
+        self._assert_ports_not_updated(ports)
1190
+
1191
+        output = self.output.getvalue()
1192
+        self.assertIn(
1193
+            'Rolling back port update',
1194
+            output)
1195
+        self.assertIn(
1196
+            'Unable to update ports with allocations',
1197
+            output)
1198
+        self.assertEqual(6, result)
1199
+
1200
+    def test_heal_multiple_port_allocations_rollback_fails(self):
1201
+        """Test neutron port update rollback error case. Try to heal three
1202
+        ports and make the last port update to fail in neutron. Also make the
1203
+        rollback of the second port update to fail.
1204
+        """
1205
+        port2 = self.neutron.create_port()['port']
1206
+        port3 = self.neutron.create_port(port2)['port']
1207
+        server, _ = self._create_server_with_missing_port_alloc(
1208
+            [self.neutron.port_1, port2, port3])
1209
+
1210
+        orig_update_port = self.neutron.update_port
1211
+        port_updates = []
1212
+
1213
+        def fake_update_port(port_id, *args, **kwargs):
1214
+            # 0, 1: the first two update operation succeeds
1215
+            # 4: the last rollback operation succeeds
1216
+            if len(port_updates) in [0, 1, 4]:
1217
+                port_updates.append(port_id)
1218
+                return orig_update_port(port_id, *args, **kwargs)
1219
+            # 2 : last update operation fails
1220
+            # 3 : the first rollback operation also fails
1221
+            if len(port_updates) in [2, 3]:
1222
+                port_updates.append(port_id)
1223
+                raise neutron_client_exc.Forbidden()
1224
+
1225
+        with mock.patch.object(
1226
+                self.neutron, "update_port",
1227
+                side_effect=fake_update_port) as mock_update_port:
1228
+            # let's trigger a heal
1229
+            result = self.cli.heal_allocations(verbose=True, max_count=2)
1230
+            self.assertEqual(5, mock_update_port.call_count)
1231
+
1232
+        self._assert_placement_not_updated(server)
1233
+
1234
+        # the order of the ports is random due to usage of dicts so we
1235
+        # need the info from the fake_update_port that which port update
1236
+        # failed
1237
+        # the first port update was successful, this will be the first port to
1238
+        # rollback too and the rollback will fail
1239
+        self._assert_port_updated(port_updates[0])
1240
+        # the second port update was successful, this will be the second port
1241
+        # to rollback which will succeed
1242
+        self._assert_port_not_updated(port_updates[1])
1243
+        # the third port was never updated successfully
1244
+        self._assert_port_not_updated(port_updates[2])
1245
+
1246
+        output = self.output.getvalue()
1247
+        self.assertIn(
1248
+            'Rolling back port update',
1249
+            output)
1250
+        self.assertIn(
1251
+            'Failed to update neutron ports with allocation keys and the '
1252
+            'automatic rollback of the previously successful port updates '
1253
+            'also failed',
1254
+            output)
1255
+        # as we failed to roll back the first port update we instruct the user
1256
+        # to clean it up manually
1257
+        self.assertIn(
1258
+            "Make sure that the binding:profile.allocation key of the "
1259
+            "affected ports ['%s'] are manually cleaned in neutron"
1260
+            % port_updates[0],
1261
+            output)
1262
+        self.assertEqual(7, result)
1263
+
1264
+    def _test_heal_port_allocation_placement_unavailable(
1265
+            self, server, ports, error):
1266
+
1267
+        with mock.patch('nova.cmd.manage.PlacementCommands.'
1268
+                        '_get_rps_in_tree_with_required_traits',
1269
+                        side_effect=error):
1270
+            result = self.cli.heal_allocations(verbose=True, max_count=2)
1271
+
1272
+        self._assert_placement_not_updated(server)
1273
+        self._assert_ports_not_updated(ports)
1274
+
1275
+        self.assertEqual(3, result)
1276
+
1277
+    def test_heal_port_allocation_placement_unavailable(self):
1278
+        server, ports = self._create_server_with_missing_port_alloc(
1279
+            [self.neutron.port_1])
1280
+
1281
+        for error in [
1282
+            exception.PlacementAPIConnectFailure(),
1283
+            exception.ResourceProviderRetrievalFailed(uuid=uuidsentinel.rp1),
1284
+            exception.ResourceProviderTraitRetrievalFailed(
1285
+                uuid=uuidsentinel.rp1)]:
1286
+
1287
+            self._test_heal_port_allocation_placement_unavailable(
1288
+                server, ports, error)
1289
+
1290
+
744 1291
 class TestNovaManagePlacementSyncAggregates(
745 1292
         integrated_helpers.ProviderUsageBaseTestCase):
746 1293
     """Functional tests for nova-manage placement sync_aggregates"""

+ 117
- 2
nova/tests/unit/test_nova_manage.py View File

@@ -2412,6 +2412,8 @@ class TestNovaManagePlacement(test.NoDBTestCase):
2412 2412
         self.output = StringIO()
2413 2413
         self.useFixture(fixtures.MonkeyPatch('sys.stdout', self.output))
2414 2414
         self.cli = manage.PlacementCommands()
2415
+        self.useFixture(
2416
+            fixtures.MockPatch('nova.network.neutronv2.api.get_client'))
2415 2417
 
2416 2418
     @ddt.data(-1, 0, "one")
2417 2419
     def test_heal_allocations_invalid_max_count(self, max_count):
@@ -2469,7 +2471,7 @@ class TestNovaManagePlacement(test.NoDBTestCase):
2469 2471
     @mock.patch('nova.objects.ComputeNode.get_by_host_and_nodename',
2470 2472
                 return_value=objects.ComputeNode(uuid=uuidsentinel.node))
2471 2473
     @mock.patch('nova.scheduler.utils.resources_from_flavor',
2472
-                return_value=mock.sentinel.resources)
2474
+                return_value={'VCPU': 1})
2473 2475
     @mock.patch('nova.scheduler.client.report.SchedulerReportClient.put',
2474 2476
                 return_value=fake_requests.FakeResponse(
2475 2477
                     500, content=jsonutils.dumps({"errors": [{"code": ""}]})))
@@ -2487,7 +2489,7 @@ class TestNovaManagePlacement(test.NoDBTestCase):
2487 2489
         expected_payload = {
2488 2490
             'allocations': {
2489 2491
                 uuidsentinel.node: {
2490
-                    'resources': mock.sentinel.resources
2492
+                    'resources': {'VCPU': 1}
2491 2493
                 }
2492 2494
             },
2493 2495
             'user_id': 'fake-user',
@@ -2789,6 +2791,119 @@ class TestNovaManagePlacement(test.NoDBTestCase):
2789 2791
                       self.output.getvalue())
2790 2792
         self.assertIn("Conflict!", self.output.getvalue())
2791 2793
 
2794
+    def test_has_request_but_no_allocation(self):
2795
+        # False because there is a full resource_request and allocation set.
2796
+        self.assertFalse(
2797
+            self.cli._has_request_but_no_allocation(
2798
+                {
2799
+                    'id': uuidsentinel.healed,
2800
+                    'resource_request': {
2801
+                        'resources': {
2802
+                            'NET_BW_EGR_KILOBIT_PER_SEC': 1000,
2803
+                        },
2804
+                        'required': [
2805
+                            'CUSTOM_VNIC_TYPE_NORMAL'
2806
+                        ]
2807
+                    },
2808
+                    'binding:profile': {'allocation': uuidsentinel.rp1}
2809
+                }))
2810
+        # True because there is a full resource_request but no allocation set.
2811
+        self.assertTrue(
2812
+            self.cli._has_request_but_no_allocation(
2813
+                {
2814
+                    'id': uuidsentinel.needs_healing,
2815
+                    'resource_request': {
2816
+                        'resources': {
2817
+                            'NET_BW_EGR_KILOBIT_PER_SEC': 1000,
2818
+                        },
2819
+                        'required': [
2820
+                            'CUSTOM_VNIC_TYPE_NORMAL'
2821
+                        ]
2822
+                    },
2823
+                    'binding:profile': {}
2824
+                }))
2825
+        # True because there is a full resource_request but no allocation set.
2826
+        self.assertTrue(
2827
+            self.cli._has_request_but_no_allocation(
2828
+                {
2829
+                    'id': uuidsentinel.needs_healing_null_profile,
2830
+                    'resource_request': {
2831
+                        'resources': {
2832
+                            'NET_BW_EGR_KILOBIT_PER_SEC': 1000,
2833
+                        },
2834
+                        'required': [
2835
+                            'CUSTOM_VNIC_TYPE_NORMAL'
2836
+                        ]
2837
+                    },
2838
+                    'binding:profile': None,
2839
+                }))
2840
+        # False because there are no resources in the resource_request.
2841
+        self.assertFalse(
2842
+            self.cli._has_request_but_no_allocation(
2843
+                {
2844
+                    'id': uuidsentinel.empty_resources,
2845
+                    'resource_request': {
2846
+                        'resources': {},
2847
+                        'required': [
2848
+                            'CUSTOM_VNIC_TYPE_NORMAL'
2849
+                        ]
2850
+                    },
2851
+                    'binding:profile': {}
2852
+                }))
2853
+        # False because there are no resources in the resource_request.
2854
+        self.assertFalse(
2855
+            self.cli._has_request_but_no_allocation(
2856
+                {
2857
+                    'id': uuidsentinel.missing_resources,
2858
+                    'resource_request': {
2859
+                        'required': [
2860
+                            'CUSTOM_VNIC_TYPE_NORMAL'
2861
+                        ]
2862
+                    },
2863
+                    'binding:profile': {}
2864
+                }))
2865
+        # False because there are no required traits in the resource_request.
2866
+        self.assertFalse(
2867
+            self.cli._has_request_but_no_allocation(
2868
+                {
2869
+                    'id': uuidsentinel.empty_required,
2870
+                    'resource_request': {
2871
+                        'resources': {
2872
+                            'NET_BW_EGR_KILOBIT_PER_SEC': 1000,
2873
+                        },
2874
+                        'required': []
2875
+                    },
2876
+                    'binding:profile': {}
2877
+                }))
2878
+        # False because there are no required traits in the resource_request.
2879
+        self.assertFalse(
2880
+            self.cli._has_request_but_no_allocation(
2881
+                {
2882
+                    'id': uuidsentinel.missing_required,
2883
+                    'resource_request': {
2884
+                        'resources': {
2885
+                            'NET_BW_EGR_KILOBIT_PER_SEC': 1000,
2886
+                        },
2887
+                    },
2888
+                    'binding:profile': {}
2889
+                }))
2890
+        # False because there are no resources or required traits in the
2891
+        # resource_request.
2892
+        self.assertFalse(
2893
+            self.cli._has_request_but_no_allocation(
2894
+                {
2895
+                    'id': uuidsentinel.empty_resource_request,
2896
+                    'resource_request': {},
2897
+                    'binding:profile': {}
2898
+                }))
2899
+        # False because there is no resource_request.
2900
+        self.assertFalse(
2901
+            self.cli._has_request_but_no_allocation(
2902
+                {
2903
+                    'id': uuidsentinel.missing_resource_request,
2904
+                    'binding:profile': {}
2905
+                }))
2906
+
2792 2907
 
2793 2908
 class TestNovaManageMain(test.NoDBTestCase):
2794 2909
     """Tests the nova-manage:main() setup code."""

+ 9
- 0
releasenotes/notes/nova-manage-heal-port-allocation-48cc1a34c92d42cd.yaml View File

@@ -0,0 +1,9 @@
1
+---
2
+other:
3
+  - |
4
+    The ``nova-manage placement heal_allocations`` `CLI`_ has been extended to
5
+    heal missing port allocations which are possible due to `bug 1819923`_ .
6
+
7
+
8
+    .. _bug 1819923: https://bugs.launchpad.net/nova/+bug/1819923
9
+    .. _CLI: https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement

Loading…
Cancel
Save