scheduler: Start utilizing RequestSpec.network_metadata
Now that we have this information, we can use it as a pre-filtering for suitable hosts. With this patch we complete the blueprint. As a result, documentation and release notes are bundled in the patch and previously inactive tests are now enabled. Part of blueprint numa-aware-vswitches Change-Id: Ide262733ffd7714fdc702b31c61bdd42dbf7acc3
This commit is contained in:
parent
7aa9d1c23b
commit
803f85d7e6
@ -31,6 +31,14 @@ Simultaneous Multi-Threading (SMT)
|
||||
CPUs on the system and can execute workloads in parallel. However, as with
|
||||
NUMA, threads compete for shared resources.
|
||||
|
||||
Non Uniform I/O Access (NUMA I/O)
|
||||
In a NUMA system, I/O to a device mapped to a local memory region is more
|
||||
efficient than I/O to a remote device. A device connected to the same socket
|
||||
providing the CPU and memory offers lower latencies for I/O operations due to
|
||||
its physical proximity. This generally manifests itself in devices connected
|
||||
to the PCIe bus, such as NICs or vGPUs, but applies to any device support
|
||||
memory mapped I/O.
|
||||
|
||||
In OpenStack, SMP CPUs are known as *cores*, NUMA cells or nodes are known as
|
||||
*sockets*, and SMT CPUs are known as *threads*. For example, a quad-socket,
|
||||
eight core system with Hyper-Threading would have four sockets, eight cores per
|
||||
|
@ -31,6 +31,7 @@ operating system, and exposes functionality over a web-based API.
|
||||
manage-volumes.rst
|
||||
migration.rst
|
||||
networking-nova.rst
|
||||
networking.rst
|
||||
node-down.rst
|
||||
pci-passthrough.rst
|
||||
quotas2.rst
|
||||
|
184
doc/source/admin/networking.rst
Normal file
184
doc/source/admin/networking.rst
Normal file
@ -0,0 +1,184 @@
|
||||
=======================
|
||||
Networking with neutron
|
||||
=======================
|
||||
|
||||
While nova uses the :neutron-doc:`OpenStack Networking service (neutron) <>` to
|
||||
provide network connectivity for instances, nova itself provides some
|
||||
additional features not possible with neutron alone. These are described below.
|
||||
|
||||
|
||||
SR-IOV
|
||||
------
|
||||
|
||||
.. versionchanged:: 2014.2
|
||||
|
||||
The feature described below was first introduced in the Juno release.
|
||||
|
||||
The SR-IOV specification defines a standardized mechanism to virtualize PCIe
|
||||
devices. This mechanism can virtualize a single PCIe Ethernet controller to
|
||||
appear as multiple PCIe devices. Each device can be directly assigned to an
|
||||
instance, bypassing the hypervisor and virtual switch layer. As a result, users
|
||||
are able to achieve low latency and near-line wire speed.
|
||||
|
||||
A full guide on configuring and using SR-IOV is provided in the
|
||||
:neutron-doc:`OpenStack Networking service documentation
|
||||
<admin/config-sriov.html>`
|
||||
|
||||
|
||||
NUMA Affinity
|
||||
-------------
|
||||
|
||||
.. versionadded:: 18.0.0
|
||||
|
||||
The feature described below was first introduced in the Rocky release.
|
||||
|
||||
.. important::
|
||||
|
||||
The functionality described below is currently only supported by the
|
||||
libvirt/KVM driver.
|
||||
|
||||
As described in :doc:`cpu-topologies`, NUMA is a computer architecture where
|
||||
memory accesses to certain regions of system memory can have higher latencies
|
||||
than other regions, depending on the CPU(s) your process is running on. This
|
||||
effect extends to devices connected to the PCIe bus, a concept known as NUMA
|
||||
I/O. Many Network Interface Cards (NICs) connect using the PCIe interface,
|
||||
meaning they are susceptible to the ill-effects of poor NUMA affinitization. As
|
||||
a result, NUMA locality must be considered when creating an instance where high
|
||||
dataplane performance is a requirement.
|
||||
|
||||
Fortunately, nova provides functionality to ensure NUMA affinitization is
|
||||
provided for instances using neutron. How this works depends on the type of
|
||||
port you are trying to use.
|
||||
|
||||
.. todo::
|
||||
|
||||
Add documentation for PCI NUMA affinity and PCI policies and link to it from
|
||||
here.
|
||||
|
||||
For SR-IOV ports, virtual functions, which are PCI devices, are attached to the
|
||||
instance. This means the instance can benefit from the NUMA affinity guarantees
|
||||
provided for PCI devices. This happens automatically.
|
||||
|
||||
For all other types of ports, some manual configuration is required.
|
||||
|
||||
#. Identify the type of network(s) you wish to provide NUMA affinity for.
|
||||
|
||||
- If a network is an L2-type network (``provider:network_type`` of ``flat``
|
||||
or ``vlan``), affinity of the network to given NUMA node(s) can vary
|
||||
depending on value of the ``provider:physical_network`` attribute of the
|
||||
network, commonly referred to as the *physnet* of the network. This is
|
||||
because most neutron drivers map each *physnet* to a different bridge, to
|
||||
which multiple NICs are attached, or to a different (logical) NIC.
|
||||
|
||||
- If a network is an L3-type networks (``provider:network_type`` of
|
||||
``vxlan``, ``gre`` or ``geneve``), all traffic will use the device to
|
||||
which the *endpoint IP* is assigned. This means all L3 networks on a given
|
||||
host will have affinity to the same NUMA node(s). Refer to
|
||||
:neutron-doc:`the neutron documentation
|
||||
<admin/intro-overlay-protocols.html>` for more information.
|
||||
|
||||
#. Determine the NUMA affinity of the NICs attached to the given network(s).
|
||||
|
||||
How this should be achieved varies depending on the switching solution used
|
||||
and whether the network is a L2-type network or an L3-type networks.
|
||||
|
||||
Consider an L2-type network using the Linux Bridge mechanism driver. As
|
||||
noted in the :neutron-doc:`neutron documentation
|
||||
<admin/deploy-lb-selfservice.html>`, *physets* are mapped to interfaces
|
||||
using the ``[linux_bridge] physical_interface_mappings`` configuration
|
||||
option. For example:
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
[linux_bridge]
|
||||
physical_interface_mappings = provider:PROVIDER_INTERFACE
|
||||
|
||||
Once you have the device name, you can query *sysfs* to retrieve the NUMA
|
||||
affinity for this device. For example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ cat /sys/class/net/PROVIDER_INTERFACE/device/numa_node
|
||||
|
||||
For an L3-type network using the Linux Bridge mechanism driver, the device
|
||||
used will be configured using protocol-specific endpoint IP configuration
|
||||
option. For VXLAN, this is the ``[vxlan] local_ip`` option. For example::
|
||||
|
||||
.. code-block::
|
||||
|
||||
[vxlan]
|
||||
local_ip = OVERLAY_INTERFACE_IP_ADDRESS
|
||||
|
||||
Once you have the IP address in question, you can use :command:`ip` to
|
||||
identify the device that has been assigned this IP address and from there
|
||||
can query the NUMA affinity using *sysfs* as above.
|
||||
|
||||
.. note::
|
||||
|
||||
The example provided above is merely that: an example. How one should
|
||||
identify this information can vary massively depending on the driver
|
||||
used, whether bonding is used, the type of network used, etc.
|
||||
|
||||
#. Configure NUMA affinity in ``nova.conf``.
|
||||
|
||||
Once you have identified the NUMA affinity of the devices used for your
|
||||
networks, you need to configure this in ``nova.conf``. As before, how this
|
||||
should be achieved varies depending on the type of network.
|
||||
|
||||
For L2-type networks, NUMA affinity is defined based on the
|
||||
``provider:physical_network`` attribute of the network. There are two
|
||||
configuration options that must be set:
|
||||
|
||||
``[neutron] physnets``
|
||||
This should be set to the list of physnets for which you wish to provide
|
||||
NUMA affinity. Refer to the :oslo.config:option:`documentation
|
||||
<neutron.physnets>` for more information.
|
||||
|
||||
``[neutron_physnet_{physnet}] numa_nodes``
|
||||
This should be set to the list of NUMA node(s) that networks with the
|
||||
given ``{physnet}`` should be affinitized to.
|
||||
|
||||
For L3-type networks, NUMA affinity is defined globally for all tunneled
|
||||
networks on a given host. There is only one configuration option that must
|
||||
be set:
|
||||
|
||||
``[neutron_tunneled] numa_nodes``
|
||||
This should be set to a list of one or NUMA nodes to which instances using
|
||||
tunneled networks will be affinitized.
|
||||
|
||||
Examples
|
||||
~~~~~~~~
|
||||
|
||||
Take an example for deployment using L2-type networks first.
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
[neutron]
|
||||
physnets = foo,bar
|
||||
|
||||
[neutron_physnet_foo]
|
||||
numa_nodes = 0
|
||||
|
||||
[neutron_physnet_bar]
|
||||
numa_nodes = 2, 3
|
||||
|
||||
This configuration will ensure instances using one or more L2-type networks
|
||||
with ``provider:physical_network=foo`` must be scheduled on host cores from
|
||||
NUMA nodes 0, while instances using one or more networks with
|
||||
``provider:physical_network=bar`` must be scheduled on host cores from both
|
||||
NUMA nodes 2 and 3. For the latter case, it will be necessary to split the
|
||||
guest across two or more host NUMA nodes using the ``hw:numa_nodes``
|
||||
:ref:`flavor extra spec <extra-specs-numa-topology>`.
|
||||
|
||||
Now, take an example for a deployment using L3 networks.
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
[neutron_tunneled]
|
||||
numa_nodes = 0
|
||||
|
||||
This is much simpler as all tunneled traffic uses the same logical interface.
|
||||
As with the L2-type networks, this configuration will ensure instances using
|
||||
one or more L3-type networks must be scheduled on host cores from NUMA node 0.
|
||||
It is also possible to define more than one NUMA node, in which case the
|
||||
instance must be split across these nodes.
|
@ -818,8 +818,9 @@ class API(base.Base):
|
||||
# InstancePCIRequests object
|
||||
pci_request_info = pci_request.get_pci_requests_from_flavor(
|
||||
instance_type)
|
||||
self.network_api.create_resource_requests(
|
||||
context, requested_networks, pci_request_info)
|
||||
|
||||
network_metadata = self.network_api.create_resource_requests(context,
|
||||
requested_networks, pci_request_info)
|
||||
|
||||
base_options = {
|
||||
'reservation_id': reservation_id,
|
||||
@ -859,13 +860,15 @@ class API(base.Base):
|
||||
|
||||
# return the validated options and maximum number of instances allowed
|
||||
# by the network quotas
|
||||
return base_options, max_network_count, key_pair, security_groups
|
||||
return (base_options, max_network_count, key_pair, security_groups,
|
||||
network_metadata)
|
||||
|
||||
def _provision_instances(self, context, instance_type, min_count,
|
||||
max_count, base_options, boot_meta, security_groups,
|
||||
block_device_mapping, shutdown_terminate,
|
||||
instance_group, check_server_group_quota, filter_properties,
|
||||
key_pair, tags, trusted_certs, supports_multiattach=False):
|
||||
key_pair, tags, trusted_certs, supports_multiattach=False,
|
||||
network_metadata=None):
|
||||
# Check quotas
|
||||
num_instances = compute_utils.check_num_instances_quota(
|
||||
context, instance_type, min_count, max_count)
|
||||
@ -901,6 +904,11 @@ class API(base.Base):
|
||||
req_spec.num_instances = num_instances
|
||||
req_spec.create()
|
||||
|
||||
# NOTE(stephenfin): The network_metadata field is not persisted
|
||||
# and is therefore set after 'create' is called.
|
||||
if network_metadata:
|
||||
req_spec.network_metadata = network_metadata
|
||||
|
||||
# Create an instance object, but do not store in db yet.
|
||||
instance = objects.Instance(context=context)
|
||||
instance.uuid = instance_uuid
|
||||
@ -1148,8 +1156,8 @@ class API(base.Base):
|
||||
self._check_auto_disk_config(image=boot_meta,
|
||||
auto_disk_config=auto_disk_config)
|
||||
|
||||
base_options, max_net_count, key_pair, security_groups = \
|
||||
self._validate_and_build_base_options(
|
||||
base_options, max_net_count, key_pair, security_groups, \
|
||||
network_metadata = self._validate_and_build_base_options(
|
||||
context, instance_type, boot_meta, image_href, image_id,
|
||||
kernel_id, ramdisk_id, display_name, display_description,
|
||||
key_name, key_data, security_groups, availability_zone,
|
||||
@ -1189,7 +1197,7 @@ class API(base.Base):
|
||||
boot_meta, security_groups, block_device_mapping,
|
||||
shutdown_terminate, instance_group, check_server_group_quota,
|
||||
filter_properties, key_pair, tags, trusted_certs,
|
||||
supports_multiattach)
|
||||
supports_multiattach, network_metadata)
|
||||
|
||||
instances = []
|
||||
request_specs = []
|
||||
|
@ -76,6 +76,10 @@ class NUMATopologyFilter(filters.BaseHostFilter):
|
||||
host_state)
|
||||
pci_requests = spec_obj.pci_requests
|
||||
|
||||
network_metadata = None
|
||||
if 'network_metadata' in spec_obj:
|
||||
network_metadata = spec_obj.network_metadata
|
||||
|
||||
if pci_requests:
|
||||
pci_requests = pci_requests.requests
|
||||
|
||||
@ -87,6 +91,10 @@ class NUMATopologyFilter(filters.BaseHostFilter):
|
||||
limits = objects.NUMATopologyLimits(
|
||||
cpu_allocation_ratio=cpu_ratio,
|
||||
ram_allocation_ratio=ram_ratio)
|
||||
|
||||
if network_metadata:
|
||||
limits.network_metadata = network_metadata
|
||||
|
||||
instance_topology = (hardware.numa_fit_instance_to_host(
|
||||
host_topology, requested_topology,
|
||||
limits=limits,
|
||||
|
@ -326,9 +326,7 @@ class NUMAServersWithNetworksTest(NUMAServersTestBase):
|
||||
flavor_id, networks)
|
||||
|
||||
self.assertTrue(filter_mock.called)
|
||||
# TODO(stephenfin): Switch this to 'ERROR' once the final patch is
|
||||
# merged
|
||||
self.assertEqual('ACTIVE', status)
|
||||
self.assertEqual('ERROR', status)
|
||||
|
||||
def test_create_server_with_physnet_and_tunneled_net(self):
|
||||
"""Test combination of physnet and tunneled network.
|
||||
|
@ -293,7 +293,7 @@ class _ComputeAPIUnitTestMixIn(object):
|
||||
mock.patch.object(self.compute_api,
|
||||
'_validate_and_build_base_options',
|
||||
return_value=({}, max_net_count, None,
|
||||
['default']))
|
||||
['default'], None))
|
||||
) as (
|
||||
get_image,
|
||||
check_auto_disk_config,
|
||||
@ -6076,7 +6076,8 @@ class ComputeAPIUnitTestCase(_ComputeAPIUnitTestMixIn, test.NoDBTestCase):
|
||||
with mock.patch.object(
|
||||
self.compute_api.security_group_api, 'get',
|
||||
return_value={'id': uuids.secgroup_uuid}) as scget:
|
||||
base_options, max_network_count, key_pair, security_groups = (
|
||||
base_options, max_network_count, key_pair, security_groups, \
|
||||
network_metadata = (
|
||||
self.compute_api._validate_and_build_base_options(
|
||||
self.context, instance_type, boot_meta, uuids.image_href,
|
||||
mock.sentinel.image_id, kernel_id, ramdisk_id,
|
||||
|
@ -628,7 +628,7 @@ class CellsConductorAPIRPCRedirect(test.NoDBTestCase):
|
||||
_validate, _get_image, _check_bdm,
|
||||
_provision, _record_action_start):
|
||||
_get_image.return_value = (None, 'fake-image')
|
||||
_validate.return_value = ({}, 1, None, ['default'])
|
||||
_validate.return_value = ({}, 1, None, ['default'], None)
|
||||
_check_bdm.return_value = objects.BlockDeviceMappingList()
|
||||
_provision.return_value = []
|
||||
|
||||
|
@ -28,13 +28,18 @@ class TestNUMATopologyFilter(test.NoDBTestCase):
|
||||
super(TestNUMATopologyFilter, self).setUp()
|
||||
self.filt_cls = numa_topology_filter.NUMATopologyFilter()
|
||||
|
||||
def _get_spec_obj(self, numa_topology):
|
||||
def _get_spec_obj(self, numa_topology, network_metadata=None):
|
||||
image_meta = objects.ImageMeta(properties=objects.ImageMetaProps())
|
||||
|
||||
spec_obj = objects.RequestSpec(numa_topology=numa_topology,
|
||||
pci_requests=None,
|
||||
instance_uuid=uuids.fake,
|
||||
flavor=objects.Flavor(extra_specs={}),
|
||||
image=image_meta)
|
||||
|
||||
if network_metadata:
|
||||
spec_obj.network_metadata = network_metadata
|
||||
|
||||
return spec_obj
|
||||
|
||||
def test_numa_topology_filter_pass(self):
|
||||
@ -230,3 +235,61 @@ class TestNUMATopologyFilter(test.NoDBTestCase):
|
||||
'cpu_allocation_ratio': 16.0,
|
||||
'ram_allocation_ratio': 1.5})
|
||||
self.assertFalse(self.filt_cls.host_passes(host, spec_obj))
|
||||
|
||||
def _get_fake_host_state_with_networks(self):
|
||||
network_a = objects.NetworkMetadata(physnets=set(['foo', 'bar']),
|
||||
tunneled=False)
|
||||
network_b = objects.NetworkMetadata(physnets=set(), tunneled=True)
|
||||
host_topology = objects.NUMATopology(cells=[
|
||||
objects.NUMACell(id=1, cpuset=set([1, 2]), memory=2048,
|
||||
cpu_usage=2, memory_usage=2048, mempages=[],
|
||||
siblings=[set([1]), set([2])],
|
||||
pinned_cpus=set([]),
|
||||
network_metadata=network_a),
|
||||
objects.NUMACell(id=2, cpuset=set([3, 4]), memory=2048,
|
||||
cpu_usage=2, memory_usage=2048, mempages=[],
|
||||
siblings=[set([3]), set([4])],
|
||||
pinned_cpus=set([]),
|
||||
network_metadata=network_b)])
|
||||
|
||||
return fakes.FakeHostState('host1', 'node1', {
|
||||
'numa_topology': host_topology,
|
||||
'pci_stats': None,
|
||||
'cpu_allocation_ratio': 16.0,
|
||||
'ram_allocation_ratio': 1.5})
|
||||
|
||||
def test_numa_topology_filter_pass_networks(self):
|
||||
host = self._get_fake_host_state_with_networks()
|
||||
|
||||
instance_topology = objects.InstanceNUMATopology(cells=[
|
||||
objects.InstanceNUMACell(id=0, cpuset=set([1]), memory=512),
|
||||
objects.InstanceNUMACell(id=1, cpuset=set([3]), memory=512)])
|
||||
|
||||
network_metadata = objects.NetworkMetadata(
|
||||
physnets=set(['foo']), tunneled=False)
|
||||
spec_obj = self._get_spec_obj(numa_topology=instance_topology,
|
||||
network_metadata=network_metadata)
|
||||
self.assertTrue(self.filt_cls.host_passes(host, spec_obj))
|
||||
|
||||
# this should pass because while the networks are affined to different
|
||||
# host NUMA nodes, our guest itself has multiple NUMA nodes
|
||||
network_metadata = objects.NetworkMetadata(
|
||||
physnets=set(['foo', 'bar']), tunneled=True)
|
||||
spec_obj = self._get_spec_obj(numa_topology=instance_topology,
|
||||
network_metadata=network_metadata)
|
||||
self.assertTrue(self.filt_cls.host_passes(host, spec_obj))
|
||||
|
||||
def test_numa_topology_filter_fail_networks(self):
|
||||
host = self._get_fake_host_state_with_networks()
|
||||
|
||||
instance_topology = objects.InstanceNUMATopology(cells=[
|
||||
objects.InstanceNUMACell(id=0, cpuset=set([1]), memory=512)])
|
||||
|
||||
# this should fail because the networks are affined to different host
|
||||
# NUMA nodes but our guest only has a single NUMA node
|
||||
network_metadata = objects.NetworkMetadata(
|
||||
physnets=set(['foo']), tunneled=True)
|
||||
spec_obj = self._get_spec_obj(numa_topology=instance_topology,
|
||||
network_metadata=network_metadata)
|
||||
|
||||
self.assertFalse(self.filt_cls.host_passes(host, spec_obj))
|
||||
|
@ -0,0 +1,13 @@
|
||||
---
|
||||
features:
|
||||
- |
|
||||
It is now possible to configure NUMA affinity for most neutron networks.
|
||||
This is available for networks that use a ``provider:network_type`` of
|
||||
``flat`` or ``vlan`` and a ``provider:physical_network`` (L2 networks) or
|
||||
networks that use a ``provider:network_type`` of ``vxlan``, ``gre`` or
|
||||
``geneve`` (L3 networks).
|
||||
|
||||
For more information, refer to the `spec`__ and `documentation`__.
|
||||
|
||||
__ https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/numa-aware-vswitches.html
|
||||
__ https://docs.openstack.org/nova/latest/admin/networking.html
|
Loading…
Reference in New Issue
Block a user