scheduler: Start utilizing RequestSpec.network_metadata

Now that we have this information, we can use it as a pre-filtering for
suitable hosts.

With this patch we complete the blueprint. As a result, documentation
and release notes are bundled in the patch and previously inactive tests
are now enabled.

Part of blueprint numa-aware-vswitches

Change-Id: Ide262733ffd7714fdc702b31c61bdd42dbf7acc3
This commit is contained in:
Stephen Finucane 2018-06-14 17:04:48 +01:00 committed by Matt Riedemann
parent 7aa9d1c23b
commit 803f85d7e6
10 changed files with 298 additions and 14 deletions

View File

@ -31,6 +31,14 @@ Simultaneous Multi-Threading (SMT)
CPUs on the system and can execute workloads in parallel. However, as with
NUMA, threads compete for shared resources.
Non Uniform I/O Access (NUMA I/O)
In a NUMA system, I/O to a device mapped to a local memory region is more
efficient than I/O to a remote device. A device connected to the same socket
providing the CPU and memory offers lower latencies for I/O operations due to
its physical proximity. This generally manifests itself in devices connected
to the PCIe bus, such as NICs or vGPUs, but applies to any device support
memory mapped I/O.
In OpenStack, SMP CPUs are known as *cores*, NUMA cells or nodes are known as
*sockets*, and SMT CPUs are known as *threads*. For example, a quad-socket,
eight core system with Hyper-Threading would have four sockets, eight cores per

View File

@ -31,6 +31,7 @@ operating system, and exposes functionality over a web-based API.
manage-volumes.rst
migration.rst
networking-nova.rst
networking.rst
node-down.rst
pci-passthrough.rst
quotas2.rst

View File

@ -0,0 +1,184 @@
=======================
Networking with neutron
=======================
While nova uses the :neutron-doc:`OpenStack Networking service (neutron) <>` to
provide network connectivity for instances, nova itself provides some
additional features not possible with neutron alone. These are described below.
SR-IOV
------
.. versionchanged:: 2014.2
The feature described below was first introduced in the Juno release.
The SR-IOV specification defines a standardized mechanism to virtualize PCIe
devices. This mechanism can virtualize a single PCIe Ethernet controller to
appear as multiple PCIe devices. Each device can be directly assigned to an
instance, bypassing the hypervisor and virtual switch layer. As a result, users
are able to achieve low latency and near-line wire speed.
A full guide on configuring and using SR-IOV is provided in the
:neutron-doc:`OpenStack Networking service documentation
<admin/config-sriov.html>`
NUMA Affinity
-------------
.. versionadded:: 18.0.0
The feature described below was first introduced in the Rocky release.
.. important::
The functionality described below is currently only supported by the
libvirt/KVM driver.
As described in :doc:`cpu-topologies`, NUMA is a computer architecture where
memory accesses to certain regions of system memory can have higher latencies
than other regions, depending on the CPU(s) your process is running on. This
effect extends to devices connected to the PCIe bus, a concept known as NUMA
I/O. Many Network Interface Cards (NICs) connect using the PCIe interface,
meaning they are susceptible to the ill-effects of poor NUMA affinitization. As
a result, NUMA locality must be considered when creating an instance where high
dataplane performance is a requirement.
Fortunately, nova provides functionality to ensure NUMA affinitization is
provided for instances using neutron. How this works depends on the type of
port you are trying to use.
.. todo::
Add documentation for PCI NUMA affinity and PCI policies and link to it from
here.
For SR-IOV ports, virtual functions, which are PCI devices, are attached to the
instance. This means the instance can benefit from the NUMA affinity guarantees
provided for PCI devices. This happens automatically.
For all other types of ports, some manual configuration is required.
#. Identify the type of network(s) you wish to provide NUMA affinity for.
- If a network is an L2-type network (``provider:network_type`` of ``flat``
or ``vlan``), affinity of the network to given NUMA node(s) can vary
depending on value of the ``provider:physical_network`` attribute of the
network, commonly referred to as the *physnet* of the network. This is
because most neutron drivers map each *physnet* to a different bridge, to
which multiple NICs are attached, or to a different (logical) NIC.
- If a network is an L3-type networks (``provider:network_type`` of
``vxlan``, ``gre`` or ``geneve``), all traffic will use the device to
which the *endpoint IP* is assigned. This means all L3 networks on a given
host will have affinity to the same NUMA node(s). Refer to
:neutron-doc:`the neutron documentation
<admin/intro-overlay-protocols.html>` for more information.
#. Determine the NUMA affinity of the NICs attached to the given network(s).
How this should be achieved varies depending on the switching solution used
and whether the network is a L2-type network or an L3-type networks.
Consider an L2-type network using the Linux Bridge mechanism driver. As
noted in the :neutron-doc:`neutron documentation
<admin/deploy-lb-selfservice.html>`, *physets* are mapped to interfaces
using the ``[linux_bridge] physical_interface_mappings`` configuration
option. For example:
.. code-block:: ini
[linux_bridge]
physical_interface_mappings = provider:PROVIDER_INTERFACE
Once you have the device name, you can query *sysfs* to retrieve the NUMA
affinity for this device. For example:
.. code-block:: shell
$ cat /sys/class/net/PROVIDER_INTERFACE/device/numa_node
For an L3-type network using the Linux Bridge mechanism driver, the device
used will be configured using protocol-specific endpoint IP configuration
option. For VXLAN, this is the ``[vxlan] local_ip`` option. For example::
.. code-block::
[vxlan]
local_ip = OVERLAY_INTERFACE_IP_ADDRESS
Once you have the IP address in question, you can use :command:`ip` to
identify the device that has been assigned this IP address and from there
can query the NUMA affinity using *sysfs* as above.
.. note::
The example provided above is merely that: an example. How one should
identify this information can vary massively depending on the driver
used, whether bonding is used, the type of network used, etc.
#. Configure NUMA affinity in ``nova.conf``.
Once you have identified the NUMA affinity of the devices used for your
networks, you need to configure this in ``nova.conf``. As before, how this
should be achieved varies depending on the type of network.
For L2-type networks, NUMA affinity is defined based on the
``provider:physical_network`` attribute of the network. There are two
configuration options that must be set:
``[neutron] physnets``
This should be set to the list of physnets for which you wish to provide
NUMA affinity. Refer to the :oslo.config:option:`documentation
<neutron.physnets>` for more information.
``[neutron_physnet_{physnet}] numa_nodes``
This should be set to the list of NUMA node(s) that networks with the
given ``{physnet}`` should be affinitized to.
For L3-type networks, NUMA affinity is defined globally for all tunneled
networks on a given host. There is only one configuration option that must
be set:
``[neutron_tunneled] numa_nodes``
This should be set to a list of one or NUMA nodes to which instances using
tunneled networks will be affinitized.
Examples
~~~~~~~~
Take an example for deployment using L2-type networks first.
.. code-block:: ini
[neutron]
physnets = foo,bar
[neutron_physnet_foo]
numa_nodes = 0
[neutron_physnet_bar]
numa_nodes = 2, 3
This configuration will ensure instances using one or more L2-type networks
with ``provider:physical_network=foo`` must be scheduled on host cores from
NUMA nodes 0, while instances using one or more networks with
``provider:physical_network=bar`` must be scheduled on host cores from both
NUMA nodes 2 and 3. For the latter case, it will be necessary to split the
guest across two or more host NUMA nodes using the ``hw:numa_nodes``
:ref:`flavor extra spec <extra-specs-numa-topology>`.
Now, take an example for a deployment using L3 networks.
.. code-block:: ini
[neutron_tunneled]
numa_nodes = 0
This is much simpler as all tunneled traffic uses the same logical interface.
As with the L2-type networks, this configuration will ensure instances using
one or more L3-type networks must be scheduled on host cores from NUMA node 0.
It is also possible to define more than one NUMA node, in which case the
instance must be split across these nodes.

View File

@ -818,8 +818,9 @@ class API(base.Base):
# InstancePCIRequests object
pci_request_info = pci_request.get_pci_requests_from_flavor(
instance_type)
self.network_api.create_resource_requests(
context, requested_networks, pci_request_info)
network_metadata = self.network_api.create_resource_requests(context,
requested_networks, pci_request_info)
base_options = {
'reservation_id': reservation_id,
@ -859,13 +860,15 @@ class API(base.Base):
# return the validated options and maximum number of instances allowed
# by the network quotas
return base_options, max_network_count, key_pair, security_groups
return (base_options, max_network_count, key_pair, security_groups,
network_metadata)
def _provision_instances(self, context, instance_type, min_count,
max_count, base_options, boot_meta, security_groups,
block_device_mapping, shutdown_terminate,
instance_group, check_server_group_quota, filter_properties,
key_pair, tags, trusted_certs, supports_multiattach=False):
key_pair, tags, trusted_certs, supports_multiattach=False,
network_metadata=None):
# Check quotas
num_instances = compute_utils.check_num_instances_quota(
context, instance_type, min_count, max_count)
@ -901,6 +904,11 @@ class API(base.Base):
req_spec.num_instances = num_instances
req_spec.create()
# NOTE(stephenfin): The network_metadata field is not persisted
# and is therefore set after 'create' is called.
if network_metadata:
req_spec.network_metadata = network_metadata
# Create an instance object, but do not store in db yet.
instance = objects.Instance(context=context)
instance.uuid = instance_uuid
@ -1148,8 +1156,8 @@ class API(base.Base):
self._check_auto_disk_config(image=boot_meta,
auto_disk_config=auto_disk_config)
base_options, max_net_count, key_pair, security_groups = \
self._validate_and_build_base_options(
base_options, max_net_count, key_pair, security_groups, \
network_metadata = self._validate_and_build_base_options(
context, instance_type, boot_meta, image_href, image_id,
kernel_id, ramdisk_id, display_name, display_description,
key_name, key_data, security_groups, availability_zone,
@ -1189,7 +1197,7 @@ class API(base.Base):
boot_meta, security_groups, block_device_mapping,
shutdown_terminate, instance_group, check_server_group_quota,
filter_properties, key_pair, tags, trusted_certs,
supports_multiattach)
supports_multiattach, network_metadata)
instances = []
request_specs = []

View File

@ -76,6 +76,10 @@ class NUMATopologyFilter(filters.BaseHostFilter):
host_state)
pci_requests = spec_obj.pci_requests
network_metadata = None
if 'network_metadata' in spec_obj:
network_metadata = spec_obj.network_metadata
if pci_requests:
pci_requests = pci_requests.requests
@ -87,6 +91,10 @@ class NUMATopologyFilter(filters.BaseHostFilter):
limits = objects.NUMATopologyLimits(
cpu_allocation_ratio=cpu_ratio,
ram_allocation_ratio=ram_ratio)
if network_metadata:
limits.network_metadata = network_metadata
instance_topology = (hardware.numa_fit_instance_to_host(
host_topology, requested_topology,
limits=limits,

View File

@ -326,9 +326,7 @@ class NUMAServersWithNetworksTest(NUMAServersTestBase):
flavor_id, networks)
self.assertTrue(filter_mock.called)
# TODO(stephenfin): Switch this to 'ERROR' once the final patch is
# merged
self.assertEqual('ACTIVE', status)
self.assertEqual('ERROR', status)
def test_create_server_with_physnet_and_tunneled_net(self):
"""Test combination of physnet and tunneled network.

View File

@ -293,7 +293,7 @@ class _ComputeAPIUnitTestMixIn(object):
mock.patch.object(self.compute_api,
'_validate_and_build_base_options',
return_value=({}, max_net_count, None,
['default']))
['default'], None))
) as (
get_image,
check_auto_disk_config,
@ -6076,7 +6076,8 @@ class ComputeAPIUnitTestCase(_ComputeAPIUnitTestMixIn, test.NoDBTestCase):
with mock.patch.object(
self.compute_api.security_group_api, 'get',
return_value={'id': uuids.secgroup_uuid}) as scget:
base_options, max_network_count, key_pair, security_groups = (
base_options, max_network_count, key_pair, security_groups, \
network_metadata = (
self.compute_api._validate_and_build_base_options(
self.context, instance_type, boot_meta, uuids.image_href,
mock.sentinel.image_id, kernel_id, ramdisk_id,

View File

@ -628,7 +628,7 @@ class CellsConductorAPIRPCRedirect(test.NoDBTestCase):
_validate, _get_image, _check_bdm,
_provision, _record_action_start):
_get_image.return_value = (None, 'fake-image')
_validate.return_value = ({}, 1, None, ['default'])
_validate.return_value = ({}, 1, None, ['default'], None)
_check_bdm.return_value = objects.BlockDeviceMappingList()
_provision.return_value = []

View File

@ -28,13 +28,18 @@ class TestNUMATopologyFilter(test.NoDBTestCase):
super(TestNUMATopologyFilter, self).setUp()
self.filt_cls = numa_topology_filter.NUMATopologyFilter()
def _get_spec_obj(self, numa_topology):
def _get_spec_obj(self, numa_topology, network_metadata=None):
image_meta = objects.ImageMeta(properties=objects.ImageMetaProps())
spec_obj = objects.RequestSpec(numa_topology=numa_topology,
pci_requests=None,
instance_uuid=uuids.fake,
flavor=objects.Flavor(extra_specs={}),
image=image_meta)
if network_metadata:
spec_obj.network_metadata = network_metadata
return spec_obj
def test_numa_topology_filter_pass(self):
@ -230,3 +235,61 @@ class TestNUMATopologyFilter(test.NoDBTestCase):
'cpu_allocation_ratio': 16.0,
'ram_allocation_ratio': 1.5})
self.assertFalse(self.filt_cls.host_passes(host, spec_obj))
def _get_fake_host_state_with_networks(self):
network_a = objects.NetworkMetadata(physnets=set(['foo', 'bar']),
tunneled=False)
network_b = objects.NetworkMetadata(physnets=set(), tunneled=True)
host_topology = objects.NUMATopology(cells=[
objects.NUMACell(id=1, cpuset=set([1, 2]), memory=2048,
cpu_usage=2, memory_usage=2048, mempages=[],
siblings=[set([1]), set([2])],
pinned_cpus=set([]),
network_metadata=network_a),
objects.NUMACell(id=2, cpuset=set([3, 4]), memory=2048,
cpu_usage=2, memory_usage=2048, mempages=[],
siblings=[set([3]), set([4])],
pinned_cpus=set([]),
network_metadata=network_b)])
return fakes.FakeHostState('host1', 'node1', {
'numa_topology': host_topology,
'pci_stats': None,
'cpu_allocation_ratio': 16.0,
'ram_allocation_ratio': 1.5})
def test_numa_topology_filter_pass_networks(self):
host = self._get_fake_host_state_with_networks()
instance_topology = objects.InstanceNUMATopology(cells=[
objects.InstanceNUMACell(id=0, cpuset=set([1]), memory=512),
objects.InstanceNUMACell(id=1, cpuset=set([3]), memory=512)])
network_metadata = objects.NetworkMetadata(
physnets=set(['foo']), tunneled=False)
spec_obj = self._get_spec_obj(numa_topology=instance_topology,
network_metadata=network_metadata)
self.assertTrue(self.filt_cls.host_passes(host, spec_obj))
# this should pass because while the networks are affined to different
# host NUMA nodes, our guest itself has multiple NUMA nodes
network_metadata = objects.NetworkMetadata(
physnets=set(['foo', 'bar']), tunneled=True)
spec_obj = self._get_spec_obj(numa_topology=instance_topology,
network_metadata=network_metadata)
self.assertTrue(self.filt_cls.host_passes(host, spec_obj))
def test_numa_topology_filter_fail_networks(self):
host = self._get_fake_host_state_with_networks()
instance_topology = objects.InstanceNUMATopology(cells=[
objects.InstanceNUMACell(id=0, cpuset=set([1]), memory=512)])
# this should fail because the networks are affined to different host
# NUMA nodes but our guest only has a single NUMA node
network_metadata = objects.NetworkMetadata(
physnets=set(['foo']), tunneled=True)
spec_obj = self._get_spec_obj(numa_topology=instance_topology,
network_metadata=network_metadata)
self.assertFalse(self.filt_cls.host_passes(host, spec_obj))

View File

@ -0,0 +1,13 @@
---
features:
- |
It is now possible to configure NUMA affinity for most neutron networks.
This is available for networks that use a ``provider:network_type`` of
``flat`` or ``vlan`` and a ``provider:physical_network`` (L2 networks) or
networks that use a ``provider:network_type`` of ``vxlan``, ``gre`` or
``geneve`` (L3 networks).
For more information, refer to the `spec`__ and `documentation`__.
__ https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/numa-aware-vswitches.html
__ https://docs.openstack.org/nova/latest/admin/networking.html