Use VFs for DPDK apps in pods inside VM

This commit allows to use virtual functions in pods,
when these virtual functions are passed into VM.
Documentation in this commit describes the steps which
are necessary to run DPDK applications inside containers.

Limitations:
    1. MAC and VLAN tag could not be set from inside the VM
    2. Down state of VFs may ocure after rebinding process

Change-Id: I9d1c74b36afcd32d0c583a70a3d819eb775ae372
Signed-off-by: Danil Golov <d.golov@samsung.com>
This commit is contained in:
Danil Golov 2020-02-21 13:41:59 +03:00 committed by Michał Dulko
parent 4ebece5fa6
commit 9662b35826
2 changed files with 209 additions and 62 deletions

View File

@ -6,31 +6,32 @@ How to configure SR-IOV ports
Current approach of SR-IOV relies on `sriov-device-plugin`_. While creating
pods with SR-IOV, sriov-device-plugin should be turned on on all nodes. To use
a SR-IOV port on a baremetal installation the 3 following steps should be done:
a SR-IOV port on a baremetal or VM installation following steps should be done:
#. Create OpenStack network and subnet for SR-IOV. Following steps should be
#. Create OpenStack networks and subnets for SR-IOV. Following steps should be
done with admin rights.
.. code-block:: console
$ neutron net-create vlan-sriov-net --shared --provider:physical_network physnet10_4 --provider:network_type vlan --provider:segmentation_id 3501
$ neutron subnet-create vlan-sriov-net 203.0.114.0/24 --name vlan-sriov-subnet --gateway 203.0.114.1
$ openstack network create --share --provider-physical-network physnet22 --provider-network-type vlan --provider-segment 3501 vlan-sriov-net-1
$ openstack network create --share --provider-physical-network physnet23 --provider-network-type vlan --provider-segment 3502 vlan-sriov-net-2
$ openstack subnet create --network vlan-sriov-net-1 --subnet-range 192.168.2.0/24 vlan-sriov-subnet-1
$ openstack subnet create --network vlan-sriov-net-2 --subnet-range 192.168.3.0/24 vlan-sriov-subnet-2
Subnet id <UUID of vlan-sriov-net> will be used later in NetworkAttachmentDefinition.
Subnet ids of ``vlan-sriov-subnet-1`` and ``vlan-sriov-subnet-2`` will be
used later in NetworkAttachmentDefinition.
#. Add sriov section into kuryr.conf.
.. code-block:: ini
[sriov]
default_physnet_subnets = physnet1:<UUID of vlan-sriov-net>
default_physnet_subnets = physnet22:<UUID of vlan-sriov-subnet-1>,physnet23:<UUID of vlan-sriov-subnet-2>
device_plugin_resource_prefix = intel.com
physnet_resource_mappings = physnet22:physnet22,physnet23:physnet23
resource_driver_mappings = physnet22:vfio-pci,physnet23:vfio-pci
This mapping is required for ability to find appropriate PF/VF functions at
binding phase. physnet1 is just an identifier for subnet <UUID of
vlan-sriov-net>. Such kind of transition is necessary to support
many-to-many relation.
#. Prepare NetworkAttachmentDefinition object. Apply
#. Prepare NetworkAttachmentDefinition objects. Apply
NetworkAttachmentDefinition with "sriov" driverType inside, as described in
`NPWG spec`_.
@ -42,53 +43,109 @@ a SR-IOV port on a baremetal installation the 3 following steps should be done:
name: "sriov-net1"
annotations:
openstack.org/kuryr-config: '{
"subnetId": "UUID of vlan-sriov-net",
"subnetId": "UUID of vlan-sriov-subnet-1",
"driverType": "sriov"
}'
.. code-block:: yaml
Then add k8s.v1.cni.cncf.io/networks and request/limits for SR-IOV into the
pod's yaml.
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: "sriov-net2"
annotations:
openstack.org/kuryr-config: '{
"subnetId": "UUID of vlan-sriov-subnet-2",
"driverType": "sriov"
}'
Use the following yaml to create pod with two additional SR-IOV interfaces:
.. code-block:: yaml
kind: Pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-pod
namespace: my-namespace
annotations:
k8s.v1.cni.cncf.io/networks: sriov-net1,sriov-net2
name: nginx-sriov
labels:
app: nginx-sriov
spec:
containers:
- name: containerName
image: containerImage
imagePullPolicy: IfNotPresent
command: ["tail", "-f", "/dev/null"]
resources:
requests:
intel.com/sriov: '2'
limits:
intel.com/sriov: '2'
replicas: 1
selector:
matchLabels:
app: nginx-sriov
template:
metadata:
labels:
app: nginx-sriov
annotations:
k8s.v1.cni.cncf.io/networks: sriov-net1,sriov-net2
spec:
containers:
- securityContext:
privileged: true
capabilities:
add:
- SYS_ADMIN
- IPC_LOCK
- SYS_NICE
- SYS_RAWIO
name: nginx-sriov
image: nginx:1.13.8
resources:
requests:
intel.com/physnet22: '1'
intel.com/physnet23: '1'
cpu: "2"
memory: "512Mi"
hugepages-2Mi: 512Mi
limits:
intel.com/physnet22: '1'
intel.com/physnet23: '1'
cpu: "2"
memory: "512Mi"
hugepages-2Mi: 512Mi
volumeMounts:
- name: dev
mountPath: /dev
- name: hugepage
mountPath: /hugepages
- name: sys
mountPath: /sys
volumes:
- name: dev
hostPath:
path: /dev
type: Directory
- name: hugepage
emptyDir:
medium: HugePages
- name: sys
hostPath:
path: /sys
In the above example two SR-IOV devices will be attached to pod. First one
is described in sriov-net1 NetworkAttachmentDefinition, second one in
sriov-net2. They may have different subnetId.
is described in sriov-net-2 NetworkAttachmentDefinition, second one is in
sriov-net-3. They may have different subnetId. It is necessary to mount
``/dev`` and ``/hugepages`` host's directories into pod to make pod available
to use vfio devices. ``privileged: true`` is necessary only in case if node
is a virtual machine. For baremetal node this option is not necessary.
``IPC_LOCK`` capability and other ones are necessary for case when node is
a virtual machine.
#. Specify resource names
The resource name *intel.com/sriov*, which used in the above example is the
default resource name. This name was used in SR-IOV network device plugin in
version 1 (release-v1 branch). But since latest version the device plugin
can use any arbitrary name of the resources (see `SRIOV network device
plugin for Kubernetes`_). This name should match "^\[a-zA-Z0-9\_\]+$"
The resource names *intel.com/physnet22* and *intel.com/physnet23*, which
are used in the above example are the resource names (see `SRIOV network
device plugin for Kubernetes`_). This name should match "^\[a-zA-Z0-9\_\]+$"
regular expression. To be able to work with arbitrary resource names
physnet_resource_mappings and device_plugin_resource_prefix in [sriov]
section of kuryr-controller configuration file should be filled. The
default value for device_plugin_resource_prefix is intel.com, the same as in
SR-IOV network device plugin, in case of SR-IOV network device plugin was
started with value of -resource-prefix option different from intel.com, than
value should be set to device_plugin_resource_prefix, otherwise
kuryr-kubernetes will not work with resource.
section of kuryr-controller configuration file should be filled. The
default value for device_plugin_resource_prefix is ``intel.com``, the same
as in SR-IOV network device plugin, in case of SR-IOV network device plugin
was started with value of -resource-prefix option different from
``intel.com``, than value should be set to device_plugin_resource_prefix,
otherwise kuryr-kubernetes will not work with resource.
Assume we have following SR-IOV network device plugin (defined by
-config-file option)
@ -99,18 +156,27 @@ a SR-IOV port on a baremetal installation the 3 following steps should be done:
"resourceList":
[
{
"resourceName": "numa0",
"resourceName": "physnet22",
"rootDevices": ["0000:02:00.0"],
"sriovMode": true,
"deviceType": "netdevice"
"deviceType": "vfio"
},
{
"resourceName": "physnet23",
"rootDevices": ["0000:02:00.1"],
"sriovMode": true,
"deviceType": "vfio"
}
]
}
We defined numa0 resource name, also assume we started sriovdp with
-resource-prefix samsung.com value. The PCI address of ens4f0 interface is
"0000:02:00.0". If we assigned 8 VF to ens4f0 and launch SR-IOV network
device plugin, we can see following state of kubernetes
The config file above describes two physical devices mapped on two
resources. Virtual functions from these devices will be used for pods.
We defined ``physnet22`` and ``physnet23`` as resource names, also assume
we started sriovdp with -resource-prefix intel.com value. The PCI address
of ens6 interface is "0000:02:00.0" and the PCI address of ens8 interface
is "0000:02:00.1". If we assigned 8 VF to ens6 and 8 VF to ens8 and launch
SR-IOV network device plugin, we can see following state of kubernetes:
.. code-block:: console
@ -120,18 +186,58 @@ a SR-IOV port on a baremetal installation the 3 following steps should be done:
"ephemeral-storage": "269986638772",
"hugepages-1Gi": "8Gi",
"hugepages-2Mi": "0Gi",
"samsung.com/numa0": "8",
"intel.com/physnet22": "8",
"intel.com/physnet23": "8",
"memory": "7880620Ki",
"pods": "1k"
}
We have to add to the sriov section following mapping:
If you use a virtual machine as your worker node, then it is necessary to
use sriov-device-plugin of version 3.1 because it provides selectors which
are important to separate particular VFs which are passed into VM.
.. code-block:: ini
Config file for sriov-device-plugin may look like:
[sriov]
device_plugin_resource_prefix = samsung.com
physnet_resource_mappings = physnet1:numa0
.. code-block:: json
{
"resourceList": [{
"resourceName": "physnet22",
"selectors": {
"vendors": ["8086"],
"devices": ["1520"],
"pfNames": ["ens6"]
}
},
{
"resourceName": "physnet23",
"selectors": {
"vendors": ["8086"],
"devices": ["1520"],
"pfNames": ["ens8"]
}
}
]
}
We defined ``physnet22`` resource name that maps to ``ens6`` interface,
which is the first passed into VM virtual function. The same situation is
with ``physnet23``, it maps to ``ens8`` interface. It is important to note
that in case of virtual machine usage we should specify the names of passed
virtual functions as physical devices. Thus we expect sriov-dp to annotate
different pci addresses for each resource:
.. code-block:: console
$ kubectl get node node1 -o json | jq '.status.allocatable'
{
"cpu": "4",
"ephemeral-storage": "269986638772",
"hugepages-2Mi": "2Gi",
"intel.com/physnet22": "1",
"intel.com/physnet23": "1",
"memory": "7880620Ki",
}
#. Enable Kubelet Pod Resources feature
@ -169,6 +275,19 @@ a SR-IOV port on a baremetal installation the 3 following steps should be done:
update ports with binding:profile information. Due to this it is necessary
to make actions with privileged user with admin rights.
#. Use vfio devices in containers
To use vfio devices inside containers it is necessary to load vfio-pci
module. Remember that if our worker node is a virtual machine then it
should be loaded without iommu support:
.. code-block:: bash
rmmod vfio_pci
rmmod vfio_iommu_type1
rmmod vfio
modprobe vfio enable_unsafe_noiommu_mode=1
modprobe vfio-pci
.. _NPWG spec: https://docs.openstack.org/kuryr-kubernetes/latest/specs/rocky/npwg_spec_support.html
.. _sriov-device-plugin: https://docs.google.com/document/d/1D3dJeUUmta3sMzqw8JtWFoG2rvcJiWitVro9bsfUTEw

View File

@ -136,10 +136,11 @@ class VIFSriovDriver(health.HealthHandler, b_base.BaseBindingDriver):
if driver in constants.USERSPACE_DRIVERS:
LOG.info("PCI device %s will be rebinded to userspace network "
"driver %s", pci, driver)
self._set_vf_mac(pf, vf_index, vif.address)
if vif.network.should_provide_vlan:
vlan_id = vif.network.vlan
self._set_vf_vlan(pf, vf_index, vlan_id)
if vf_index and pf:
self._set_vf_mac(pf, vf_index, vif.address)
if vif.network.should_provide_vlan:
vlan_id = vif.network.vlan
self._set_vf_vlan(pf, vf_index, vlan_id)
old_driver = self._bind_device(pci, driver)
self._annotate_device(pod_link, pci, old_driver, driver, port_id)
else:
@ -152,9 +153,10 @@ class VIFSriovDriver(health.HealthHandler, b_base.BaseBindingDriver):
def _move_to_netns(self, pci, ifname, netns, vif, vf_name, vf_index, pf,
pci_info):
if vif.network.should_provide_vlan:
vlan_id = vif.network.vlan
self._set_vf_vlan(pf, vf_index, vlan_id)
if vf_index and pf:
if vif.network.should_provide_vlan:
vlan_id = vif.network.vlan
self._set_vf_vlan(pf, vf_index, vlan_id)
self._set_vf_mac(pf, vf_index, vif.address)
@ -180,6 +182,13 @@ class VIFSriovDriver(health.HealthHandler, b_base.BaseBindingDriver):
vf_name = vf_names[0]
pfysfn_path = '/sys/bus/pci/devices/{}/physfn/net/'.format(pci)
# If physical function is not specified in VF's directory then
# this VF belongs to current VM node
if not os.path.exists(pfysfn_path):
LOG.info("Current device %s is a virtual function which is "
"passed into VM. Getting it's pci info", vf_name)
pci_info = self._get_vf_pci_info(pci, vf_name)
return vf_name, None, None, pci_info
pf_names = os.listdir(pfysfn_path)
pf_name = pf_names[0]
@ -194,6 +203,25 @@ class VIFSriovDriver(health.HealthHandler, b_base.BaseBindingDriver):
return vf_name, vf_index, pf_name, pci_info
return None, None, None, None
def _get_vf_pci_info(self, pci, vf_name):
vendor_path = '/sys/bus/pci/devices/{}/vendor'.format(pci)
with open(vendor_path) as vendor_file:
# vendor_full contains a hex value (e.g. 0x8086)
vendor_full = vendor_file.read()
vendor = vendor_full.split('x')[1].strip()
device_path = '/sys/bus/pci/devices/{}/device'.format(pci)
LOG.info("Full path to device which is being processed",
device_path)
with open(device_path) as device_file:
# device_full contains a hex value (e.g. 0x1520)
device_full = device_file.read()
device = device_full.split('x')[1].strip()
pci_vendor_info = '{}:{}'.format(vendor, device)
return {'pci_slot': pci,
'pci_vendor_info': pci_vendor_info}
def _bind_device(self, pci, driver, old_driver=None):
if not old_driver:
old_driver_path = '/sys/bus/pci/devices/{}/driver'.format(pci)