Merge "Fix to implement 'pack' or 'spread' VM's NUMA cells"
This commit is contained in:
commit
cefc78e2d8
|
@ -95,14 +95,64 @@ In all cases where NUMA awareness is used, the ``NUMATopologyFilter``
|
|||
filter must be enabled. Details on this filter are provided in
|
||||
:doc:`/admin/scheduling`.
|
||||
|
||||
The host's NUMA node(s) used are chosen based on some logic and controlled by
|
||||
``packing_host_numa_cells_allocation_strategy`` configuration variable in
|
||||
nova.conf. By default ``packing_host_numa_cells_allocation_strategy``
|
||||
variable is set to ``True``. It leads to attempt to chose NUMA node(s) with
|
||||
less amount of free resources (or in other words **more used** NUMA nodes)
|
||||
first. It is so-called "pack" strategy - we try to place as much as possible
|
||||
load at **more used** host's NUMA node until it will be completely exhausted.
|
||||
And only after we will choose **most used** host's NUMA node from the rest
|
||||
available nodes on host. "Spread" strategy is reverse to "pack" strategy.
|
||||
The NUMA node(s) with **more free** resources will be used first. So "spread"
|
||||
strategy will try to balance load between all NUMA nodes and keep number of
|
||||
free resources on all NUMA nodes as more equal as possible.
|
||||
|
||||
.. caution::
|
||||
|
||||
The NUMA node(s) used are normally chosen at random. However, if a PCI
|
||||
passthrough or SR-IOV device is attached to the instance, then the NUMA
|
||||
node that the device is associated with will be used. This can provide
|
||||
important performance improvements. However, booting a large number of
|
||||
similar instances can result in unbalanced NUMA node usage. Care should
|
||||
be taken to mitigate this issue. See this `discussion`_ for more details.
|
||||
Host's NUMA nodes are placed in list and list is sorted based on strategy
|
||||
chosen and resource available in each NUMA node. Sorts are performed on
|
||||
same list one after another, so the last sort implemented is the sort
|
||||
with most priority.
|
||||
|
||||
The python performed so-called stable sort. It means that each sort executed
|
||||
on same list will change order of list items only if item's property we sort on
|
||||
differs. If this properties in all list's items are equal than elements order
|
||||
will not changed.
|
||||
|
||||
Sorts are performed on host's NUMA nodes list in the following order:
|
||||
|
||||
* sort based on available memory on node(first sort-less priority)
|
||||
* sort based on cpu usage (in case of shared CPUs requested by guest
|
||||
VM topology) or free pinned cpus otherwise.
|
||||
* sort based on number of free PCI device on node(last sort-top priority)
|
||||
|
||||
Top sorting priority is for host's NUMA nodes with PCI devices attached. If VM
|
||||
requested PCI device(s) logic **always** puts host's NUMA nodes with more PCI
|
||||
devices at the beginnig of the host's NUMA nodes list. If PCI devices isn't
|
||||
requested by VM than NUMA nodes with no (or less) PCI device available will be
|
||||
placed at the beginnig of the list.
|
||||
|
||||
.. caution::
|
||||
|
||||
The described logic for PCI devices is used **both** for "pack" and "spread"
|
||||
strategies. It is done to keep backward compatibility with previous nova
|
||||
versions.
|
||||
|
||||
|
||||
During "pack" logic implementation rest (two) sorts are performed with sort
|
||||
order to move NUMA nodes with more available resources (CPUs and memory) at the
|
||||
END of host's NUMA nodes list. Sort based on memory is the first sort
|
||||
implemented and has least priority.
|
||||
|
||||
During "spread" logic implementation rest (two) sorts are performed with sort
|
||||
order to move NUMA nodes with more available resources (CPUs and memory) at the
|
||||
BEGINNING of host's NUMA nodes list. Sort based on memory is the first sort
|
||||
implemented and has least priority.
|
||||
|
||||
Finally resulting list (after all sorts) is passed next and attempts to place
|
||||
VM's NUMA node to host's NUMA node are performed starting from the first
|
||||
host's NUMA node in list.
|
||||
|
||||
.. caution::
|
||||
|
||||
|
@ -724,5 +774,4 @@ instances with a NUMA topology.
|
|||
|
||||
.. Links
|
||||
.. _`Image metadata`: https://docs.openstack.org/image-guide/introduction.html#image-metadata
|
||||
.. _`discussion`: http://lists.openstack.org/pipermail/openstack-dev/2016-March/090367.html
|
||||
.. _`MTTCG project`: http://wiki.qemu.org/Features/tcg-multithread
|
||||
|
|
|
@ -1007,6 +1007,23 @@ Related options:
|
|||
* ``[scheduler]query_placement_for_image_type_support`` - enables
|
||||
filtering computes based on supported image types, which is required
|
||||
to be enabled for this to take effect.
|
||||
"""),
|
||||
cfg.BoolOpt('packing_host_numa_cells_allocation_strategy',
|
||||
default=True,
|
||||
help="""
|
||||
This option controls allocation strategy used to choose NUMA cells on host for
|
||||
placing VM's NUMA cells (for VMs with defined numa topology). By
|
||||
default host's NUMA cell with more resources consumed will be chosen first for
|
||||
placing attempt. So the host cell with some usage will be packed with VM's cell
|
||||
until it will be completely exhausted, before new free host's cell will be
|
||||
used. When the packing_host_numa_cells_allocation_strategy variable is set to
|
||||
``False``, host's NUMA cell with more resources available will be used.
|
||||
|
||||
Possible values:
|
||||
|
||||
* ``True``: Packing VM's NUMA cell on most used host NUMA cell.
|
||||
* ``False``: Spreading VM's NUMA cell on host's NUMA cells with more resources
|
||||
available.
|
||||
"""),
|
||||
]
|
||||
|
||||
|
|
|
@ -19,6 +19,7 @@ import ddt
|
|||
import mock
|
||||
import testtools
|
||||
|
||||
import nova.conf
|
||||
from nova import exception
|
||||
from nova import objects
|
||||
from nova.objects import fields
|
||||
|
@ -28,6 +29,8 @@ from nova.tests.unit import fake_pci_device_pools as fake_pci
|
|||
from nova.tests.unit.image.fake import fake_image_obj
|
||||
from nova.virt import hardware as hw
|
||||
|
||||
CONF = nova.conf.CONF
|
||||
|
||||
|
||||
class InstanceInfoTests(test.NoDBTestCase):
|
||||
|
||||
|
@ -2753,7 +2756,7 @@ class VirtNUMAHostTopologyTestCase(test.NoDBTestCase):
|
|||
# the PCI device is found on host cell 1
|
||||
pci_stats = _create_pci_stats(1)
|
||||
|
||||
# ...threfore an instance without a PCI device should get host cell 2
|
||||
# ...therefore an instance without a PCI device should get host cell 2
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance1, pci_stats=pci_stats)
|
||||
self.assertIsInstance(instance_topology, objects.InstanceNUMATopology)
|
||||
|
@ -2763,7 +2766,7 @@ class VirtNUMAHostTopologyTestCase(test.NoDBTestCase):
|
|||
# the PCI device is now found on host cell 2
|
||||
pci_stats = _create_pci_stats(2)
|
||||
|
||||
# ...threfore an instance without a PCI device should get host cell 1
|
||||
# ...therefore an instance without a PCI device should get host cell 1
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance1, pci_stats=pci_stats)
|
||||
self.assertIsInstance(instance_topology, objects.InstanceNUMATopology)
|
||||
|
@ -5664,3 +5667,243 @@ class RescuePropertyTestCase(test.NoDBTestCase):
|
|||
meta = objects.ImageMeta.from_dict({'disk_format': 'raw'})
|
||||
meta.properties = objects.ImageMetaProps.from_dict(props)
|
||||
self.assertEqual(expected, hw.check_hw_rescue_props(meta))
|
||||
|
||||
|
||||
class HostCellsSortingTestCase(test.NoDBTestCase):
|
||||
# NOTE (IPO) It is possible to test all sorting cases with one defined
|
||||
# host NUMA topo.
|
||||
# We have 4 NUMA cells with the following properties:
|
||||
# NUMA cell 0: have most cpu usage
|
||||
# NUMA cell 1: will have most PCI available
|
||||
# NUMA cell 2: have most free pcpus
|
||||
# NUMA cell 3: have most available memory
|
||||
# So it will be enough to check order of NUMA cell in resulting instance
|
||||
# topo to check particular sorting case.
|
||||
|
||||
def setUp(self):
|
||||
super(HostCellsSortingTestCase, self).setUp()
|
||||
|
||||
def _create_pci_stats(node, count):
|
||||
test_dict = copy.copy(fake_pci.fake_pool_dict)
|
||||
test_dict['numa_node'] = node
|
||||
test_dict['vendor_id'] = '8086'
|
||||
test_dict['product_id'] = 'fake-prod0'
|
||||
test_dict['count'] = count
|
||||
return stats.PciDeviceStats(
|
||||
objects.NUMATopology(),
|
||||
[objects.PciDevicePool.from_dict(test_dict)])
|
||||
|
||||
self.pci_stats = _create_pci_stats(1, 2)
|
||||
|
||||
self.host = objects.NUMATopology(cells=[
|
||||
objects.NUMACell(
|
||||
id=0,
|
||||
cpuset=set([1, 2, 3, 4]),
|
||||
pcpuset=set([1, 2, 3, 4]),
|
||||
memory=4096,
|
||||
cpu_usage=3,
|
||||
memory_usage=2048,
|
||||
pinned_cpus=set([1, 2]),
|
||||
mempages=[objects.NUMAPagesTopology(
|
||||
size_kb=4, total=524288, used=0)],
|
||||
siblings=[set([1]), set([2]), set([3]), set([4])]),
|
||||
objects.NUMACell(
|
||||
id=1,
|
||||
cpuset=set([5, 6, 7, 8]),
|
||||
pcpuset=set([5, 6, 7, 8]),
|
||||
memory=4096,
|
||||
cpu_usage=2,
|
||||
memory_usage=2048,
|
||||
pinned_cpus=set([5, 6]),
|
||||
mempages=[objects.NUMAPagesTopology(
|
||||
size_kb=4, total=524288, used=0)],
|
||||
siblings=[set([5]), set([6]), set([7]), set([8])]),
|
||||
objects.NUMACell(
|
||||
id=2,
|
||||
cpuset=set([9, 10, 11, 12]),
|
||||
pcpuset=set([9, 10, 11, 12]),
|
||||
memory=4096,
|
||||
cpu_usage=2,
|
||||
memory_usage=2048,
|
||||
pinned_cpus=set(),
|
||||
mempages=[objects.NUMAPagesTopology(
|
||||
size_kb=4, total=524288, used=0)],
|
||||
siblings=[set([9]), set([10]), set([11]), set([12])]),
|
||||
objects.NUMACell(
|
||||
id=3,
|
||||
cpuset=set([13, 14, 15, 16]),
|
||||
pcpuset=set([13, 14, 15, 16]),
|
||||
memory=4096,
|
||||
cpu_usage=2,
|
||||
memory_usage=1024,
|
||||
pinned_cpus=set([13, 14]),
|
||||
mempages=[objects.NUMAPagesTopology(
|
||||
size_kb=4, total=524288, used=0)],
|
||||
siblings=[set([13]), set([14]), set([15]), set([16])])
|
||||
])
|
||||
|
||||
self.instance0 = objects.InstanceNUMATopology(cells=[
|
||||
objects.InstanceNUMACell(
|
||||
id=0, cpuset=set([0]), pcpuset=set(), memory=2048),
|
||||
objects.InstanceNUMACell(
|
||||
id=1, cpuset=set([1]), pcpuset=set(), memory=2048),
|
||||
objects.InstanceNUMACell(
|
||||
id=2, cpuset=set([2]), pcpuset=set(), memory=2048),
|
||||
objects.InstanceNUMACell(
|
||||
id=3, cpuset=set([3]), pcpuset=set(), memory=2048)
|
||||
])
|
||||
|
||||
self.instance1 = objects.InstanceNUMATopology(cells=[
|
||||
objects.InstanceNUMACell(
|
||||
id=0, cpuset=set([0]), pcpuset=set(), memory=2048),
|
||||
objects.InstanceNUMACell(
|
||||
id=1, cpuset=set([1]), pcpuset=set(), memory=2048),
|
||||
objects.InstanceNUMACell(
|
||||
id=2, cpuset=set([2]), pcpuset=set(), memory=2048),
|
||||
])
|
||||
|
||||
self.instance2 = objects.InstanceNUMATopology(cells=[
|
||||
objects.InstanceNUMACell(
|
||||
id=0,
|
||||
cpuset=set(),
|
||||
pcpuset=set([0]),
|
||||
memory=2048,
|
||||
cpu_policy=fields.CPUAllocationPolicy.DEDICATED
|
||||
),
|
||||
objects.InstanceNUMACell(
|
||||
id=1,
|
||||
cpuset=set(),
|
||||
pcpuset=set([1]),
|
||||
memory=2048,
|
||||
cpu_policy=fields.CPUAllocationPolicy.DEDICATED
|
||||
),
|
||||
objects.InstanceNUMACell(
|
||||
id=2,
|
||||
cpuset=set(),
|
||||
pcpuset=set([2]),
|
||||
memory=2048,
|
||||
cpu_policy=fields.CPUAllocationPolicy.DEDICATED
|
||||
)])
|
||||
|
||||
def assertInstanceNUMAcellOrder(self, list_to_check, instance_topo):
|
||||
for cell, id in zip(instance_topo.cells, list_to_check):
|
||||
self.assertEqual(cell.id, id)
|
||||
|
||||
def test_sort_host_numa_cell_num_equal_instance_cell_num(self):
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance0)
|
||||
self.assertInstanceNUMAcellOrder([0, 1, 2, 3], instance_topology)
|
||||
|
||||
def test_sort_no_pci_stats_no_shared_cpu_policy(self):
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
True,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance2)
|
||||
self.assertInstanceNUMAcellOrder([0, 1, 3], instance_topology)
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
False,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance2)
|
||||
self.assertInstanceNUMAcellOrder([2, 3, 0], instance_topology)
|
||||
|
||||
def test_sort_no_pci_stats_shared_cpu_policy(self):
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
True,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance1)
|
||||
self.assertInstanceNUMAcellOrder([0, 1, 2], instance_topology)
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
False,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance1)
|
||||
self.assertInstanceNUMAcellOrder([3, 1, 2], instance_topology)
|
||||
|
||||
def test_sort_pci_stats_pci_req_no_shared_cpu_policy(self):
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
True,
|
||||
group = 'compute')
|
||||
pci_request = objects.InstancePCIRequest(count=1,
|
||||
spec=[{'vendor_id': '8086', 'product_id': 'fake-prod0'}])
|
||||
pci_reqs = [pci_request]
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance2,
|
||||
pci_requests = pci_reqs,
|
||||
pci_stats = self.pci_stats)
|
||||
self.assertInstanceNUMAcellOrder([1, 0, 3], instance_topology)
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
False,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance2,
|
||||
pci_requests = pci_reqs,
|
||||
pci_stats = self.pci_stats)
|
||||
self.assertInstanceNUMAcellOrder([1, 2, 3], instance_topology)
|
||||
|
||||
def test_sort_pci_stats_pci_req_shared_cpu_policy(self):
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
True,
|
||||
group = 'compute')
|
||||
pci_request = objects.InstancePCIRequest(count=1,
|
||||
spec=[{'vendor_id': '8086', 'product_id': 'fake-prod0'}])
|
||||
pci_reqs = [pci_request]
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance1,
|
||||
pci_requests = pci_reqs,
|
||||
pci_stats = self.pci_stats)
|
||||
self.assertInstanceNUMAcellOrder([1, 0, 2], instance_topology)
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
False,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance1,
|
||||
pci_requests = pci_reqs,
|
||||
pci_stats = self.pci_stats)
|
||||
self.assertInstanceNUMAcellOrder([1, 3, 2], instance_topology)
|
||||
|
||||
def test_sort_pci_stats_no_pci_req_no_shared_cpu_policy(self):
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
True,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance2,
|
||||
pci_stats = self.pci_stats)
|
||||
self.assertInstanceNUMAcellOrder([0, 3, 2], instance_topology)
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
False,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance2,
|
||||
pci_stats = self.pci_stats)
|
||||
self.assertInstanceNUMAcellOrder([2, 3, 0], instance_topology)
|
||||
|
||||
def test_sort_pci_stats_no_pci_req_shared_cpu_policy(self):
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
True,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance1,
|
||||
pci_stats = self.pci_stats)
|
||||
self.assertInstanceNUMAcellOrder([0, 2, 3], instance_topology)
|
||||
CONF.set_override(
|
||||
'packing_host_numa_cells_allocation_strategy',
|
||||
False,
|
||||
group = 'compute')
|
||||
instance_topology = hw.numa_fit_instance_to_host(
|
||||
self.host, self.instance1,
|
||||
pci_stats = self.pci_stats)
|
||||
self.assertInstanceNUMAcellOrder([3, 2, 0], instance_topology)
|
||||
|
|
|
@ -2295,14 +2295,67 @@ def numa_fit_instance_to_host(
|
|||
|
||||
host_cells = host_topology.cells
|
||||
|
||||
# If PCI device(s) are not required, prefer host cells that don't have
|
||||
# devices attached. Presence of a given numa_node in a PCI pool is
|
||||
# indicative of a PCI device being associated with that node
|
||||
if not pci_requests and pci_stats:
|
||||
# TODO(stephenfin): pci_stats can't be None here but mypy can't figure
|
||||
# that out for some reason
|
||||
host_cells = sorted(host_cells, key=lambda cell: cell.id in [
|
||||
pool['numa_node'] for pool in pci_stats.pools]) # type: ignore
|
||||
# We need to perform all optimizations only if number of instance's
|
||||
# cells less than host's cells number. If it's equal, we'll use
|
||||
# all cells and no sorting of the cells list is needed.
|
||||
if len(host_topology) > len(instance_topology):
|
||||
pack = CONF.compute.packing_host_numa_cells_allocation_strategy
|
||||
# To balance NUMA cells usage based on several parameters
|
||||
# some sorts performed on host_cells list to move less used cells
|
||||
# to the beginning of the host_cells list (when pack variable is set to
|
||||
# 'False'). When pack is set to 'True', most used cells will be put at
|
||||
# the beginning of the host_cells list.
|
||||
|
||||
# Fist sort is based on memory usage. cell.avail_memory returns free
|
||||
# memory for cell. Revert sorting to get cells with more free memory
|
||||
# first when pack is 'False'
|
||||
host_cells = sorted(
|
||||
host_cells,
|
||||
reverse=not pack,
|
||||
key=lambda cell: cell.avail_memory)
|
||||
|
||||
# Next sort based on available dedicated or shared CPUs.
|
||||
# cpu_policy is set to the same value in all cells so we use
|
||||
# first cell in list (it exists if instance_topology defined)
|
||||
# to get cpu_policy
|
||||
if instance_topology.cells[0].cpu_policy in (
|
||||
None, fields.CPUAllocationPolicy.SHARED):
|
||||
# sort based on used CPUs
|
||||
host_cells = sorted(
|
||||
host_cells,
|
||||
reverse=pack,
|
||||
key=lambda cell: cell.cpu_usage)
|
||||
|
||||
else:
|
||||
# sort based on presence of pinned CPUs
|
||||
host_cells = sorted(
|
||||
host_cells,
|
||||
reverse=not pack,
|
||||
key=lambda cell: len(cell.free_pcpus))
|
||||
|
||||
# Perform sort only if pci_stats exists
|
||||
if pci_stats:
|
||||
# Create dict with numa cell id as key
|
||||
# and total number of free pci devices as value.
|
||||
total_pci_in_cell: ty.Dict[int, int] = {}
|
||||
for pool in pci_stats.pools:
|
||||
if pool['numa_node'] in list(total_pci_in_cell):
|
||||
total_pci_in_cell[pool['numa_node']] += pool['count']
|
||||
else:
|
||||
total_pci_in_cell[pool['numa_node']] = pool['count']
|
||||
# For backward compatibility we will always 'spread':
|
||||
# we always move host cells with PCI at the beginning if PCI
|
||||
# requested by VM and move host cells with PCI at the end of the
|
||||
# list if PCI isn't requested by VM
|
||||
if pci_requests:
|
||||
host_cells = sorted(
|
||||
host_cells,
|
||||
reverse=True,
|
||||
key=lambda cell: total_pci_in_cell.get(cell.id, 0))
|
||||
else:
|
||||
host_cells = sorted(
|
||||
host_cells,
|
||||
key=lambda cell: total_pci_in_cell.get(cell.id, 0))
|
||||
|
||||
for host_cell_perm in itertools.permutations(
|
||||
host_cells, len(instance_topology)):
|
||||
|
|
|
@ -0,0 +1,10 @@
|
|||
---
|
||||
features:
|
||||
- |
|
||||
Extra sortings were added to numa_fit_instance_to_host function
|
||||
to balance usage of hypervisor's NUMA cells. Hypervisor's NUMA
|
||||
cells with more free resources (CPU, RAM, PCI if requested)
|
||||
will be used first (spread strategy) when configuration option
|
||||
``packing_host_numa_cells_allocation_strategy`` was set to False.
|
||||
Default value of ``packing_host_numa_cells_allocation_strategy``
|
||||
option is set to True which leads to packing strategy usage.
|
Loading…
Reference in New Issue