Use VFs for DPDK apps in pods inside VM
This commit allows to use virtual functions in pods, when these virtual functions are passed into VM. Documentation in this commit describes the steps which are necessary to run DPDK applications inside containers. Limitations: 1. MAC and VLAN tag could not be set from inside the VM 2. Down state of VFs may ocure after rebinding process Change-Id: I9d1c74b36afcd32d0c583a70a3d819eb775ae372 Signed-off-by: Danil Golov <d.golov@samsung.com>
This commit is contained in:
parent
4ebece5fa6
commit
9662b35826
@ -6,31 +6,32 @@ How to configure SR-IOV ports
|
||||
|
||||
Current approach of SR-IOV relies on `sriov-device-plugin`_. While creating
|
||||
pods with SR-IOV, sriov-device-plugin should be turned on on all nodes. To use
|
||||
a SR-IOV port on a baremetal installation the 3 following steps should be done:
|
||||
a SR-IOV port on a baremetal or VM installation following steps should be done:
|
||||
|
||||
#. Create OpenStack network and subnet for SR-IOV. Following steps should be
|
||||
#. Create OpenStack networks and subnets for SR-IOV. Following steps should be
|
||||
done with admin rights.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ neutron net-create vlan-sriov-net --shared --provider:physical_network physnet10_4 --provider:network_type vlan --provider:segmentation_id 3501
|
||||
$ neutron subnet-create vlan-sriov-net 203.0.114.0/24 --name vlan-sriov-subnet --gateway 203.0.114.1
|
||||
$ openstack network create --share --provider-physical-network physnet22 --provider-network-type vlan --provider-segment 3501 vlan-sriov-net-1
|
||||
$ openstack network create --share --provider-physical-network physnet23 --provider-network-type vlan --provider-segment 3502 vlan-sriov-net-2
|
||||
$ openstack subnet create --network vlan-sriov-net-1 --subnet-range 192.168.2.0/24 vlan-sriov-subnet-1
|
||||
$ openstack subnet create --network vlan-sriov-net-2 --subnet-range 192.168.3.0/24 vlan-sriov-subnet-2
|
||||
|
||||
Subnet id <UUID of vlan-sriov-net> will be used later in NetworkAttachmentDefinition.
|
||||
Subnet ids of ``vlan-sriov-subnet-1`` and ``vlan-sriov-subnet-2`` will be
|
||||
used later in NetworkAttachmentDefinition.
|
||||
|
||||
#. Add sriov section into kuryr.conf.
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
[sriov]
|
||||
default_physnet_subnets = physnet1:<UUID of vlan-sriov-net>
|
||||
default_physnet_subnets = physnet22:<UUID of vlan-sriov-subnet-1>,physnet23:<UUID of vlan-sriov-subnet-2>
|
||||
device_plugin_resource_prefix = intel.com
|
||||
physnet_resource_mappings = physnet22:physnet22,physnet23:physnet23
|
||||
resource_driver_mappings = physnet22:vfio-pci,physnet23:vfio-pci
|
||||
|
||||
This mapping is required for ability to find appropriate PF/VF functions at
|
||||
binding phase. physnet1 is just an identifier for subnet <UUID of
|
||||
vlan-sriov-net>. Such kind of transition is necessary to support
|
||||
many-to-many relation.
|
||||
|
||||
#. Prepare NetworkAttachmentDefinition object. Apply
|
||||
#. Prepare NetworkAttachmentDefinition objects. Apply
|
||||
NetworkAttachmentDefinition with "sriov" driverType inside, as described in
|
||||
`NPWG spec`_.
|
||||
|
||||
@ -42,53 +43,109 @@ a SR-IOV port on a baremetal installation the 3 following steps should be done:
|
||||
name: "sriov-net1"
|
||||
annotations:
|
||||
openstack.org/kuryr-config: '{
|
||||
"subnetId": "UUID of vlan-sriov-net",
|
||||
"subnetId": "UUID of vlan-sriov-subnet-1",
|
||||
"driverType": "sriov"
|
||||
}'
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
Then add k8s.v1.cni.cncf.io/networks and request/limits for SR-IOV into the
|
||||
pod's yaml.
|
||||
apiVersion: "k8s.cni.cncf.io/v1"
|
||||
kind: NetworkAttachmentDefinition
|
||||
metadata:
|
||||
name: "sriov-net2"
|
||||
annotations:
|
||||
openstack.org/kuryr-config: '{
|
||||
"subnetId": "UUID of vlan-sriov-subnet-2",
|
||||
"driverType": "sriov"
|
||||
}'
|
||||
|
||||
Use the following yaml to create pod with two additional SR-IOV interfaces:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
kind: Pod
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: my-pod
|
||||
namespace: my-namespace
|
||||
annotations:
|
||||
k8s.v1.cni.cncf.io/networks: sriov-net1,sriov-net2
|
||||
name: nginx-sriov
|
||||
labels:
|
||||
app: nginx-sriov
|
||||
spec:
|
||||
containers:
|
||||
- name: containerName
|
||||
image: containerImage
|
||||
imagePullPolicy: IfNotPresent
|
||||
command: ["tail", "-f", "/dev/null"]
|
||||
resources:
|
||||
requests:
|
||||
intel.com/sriov: '2'
|
||||
limits:
|
||||
intel.com/sriov: '2'
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: nginx-sriov
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: nginx-sriov
|
||||
annotations:
|
||||
k8s.v1.cni.cncf.io/networks: sriov-net1,sriov-net2
|
||||
spec:
|
||||
containers:
|
||||
- securityContext:
|
||||
privileged: true
|
||||
capabilities:
|
||||
add:
|
||||
- SYS_ADMIN
|
||||
- IPC_LOCK
|
||||
- SYS_NICE
|
||||
- SYS_RAWIO
|
||||
name: nginx-sriov
|
||||
image: nginx:1.13.8
|
||||
resources:
|
||||
requests:
|
||||
intel.com/physnet22: '1'
|
||||
intel.com/physnet23: '1'
|
||||
cpu: "2"
|
||||
memory: "512Mi"
|
||||
hugepages-2Mi: 512Mi
|
||||
limits:
|
||||
intel.com/physnet22: '1'
|
||||
intel.com/physnet23: '1'
|
||||
cpu: "2"
|
||||
memory: "512Mi"
|
||||
hugepages-2Mi: 512Mi
|
||||
volumeMounts:
|
||||
- name: dev
|
||||
mountPath: /dev
|
||||
- name: hugepage
|
||||
mountPath: /hugepages
|
||||
- name: sys
|
||||
mountPath: /sys
|
||||
volumes:
|
||||
- name: dev
|
||||
hostPath:
|
||||
path: /dev
|
||||
type: Directory
|
||||
- name: hugepage
|
||||
emptyDir:
|
||||
medium: HugePages
|
||||
- name: sys
|
||||
hostPath:
|
||||
path: /sys
|
||||
|
||||
In the above example two SR-IOV devices will be attached to pod. First one
|
||||
is described in sriov-net1 NetworkAttachmentDefinition, second one in
|
||||
sriov-net2. They may have different subnetId.
|
||||
is described in sriov-net-2 NetworkAttachmentDefinition, second one is in
|
||||
sriov-net-3. They may have different subnetId. It is necessary to mount
|
||||
``/dev`` and ``/hugepages`` host's directories into pod to make pod available
|
||||
to use vfio devices. ``privileged: true`` is necessary only in case if node
|
||||
is a virtual machine. For baremetal node this option is not necessary.
|
||||
``IPC_LOCK`` capability and other ones are necessary for case when node is
|
||||
a virtual machine.
|
||||
|
||||
#. Specify resource names
|
||||
|
||||
The resource name *intel.com/sriov*, which used in the above example is the
|
||||
default resource name. This name was used in SR-IOV network device plugin in
|
||||
version 1 (release-v1 branch). But since latest version the device plugin
|
||||
can use any arbitrary name of the resources (see `SRIOV network device
|
||||
plugin for Kubernetes`_). This name should match "^\[a-zA-Z0-9\_\]+$"
|
||||
The resource names *intel.com/physnet22* and *intel.com/physnet23*, which
|
||||
are used in the above example are the resource names (see `SRIOV network
|
||||
device plugin for Kubernetes`_). This name should match "^\[a-zA-Z0-9\_\]+$"
|
||||
regular expression. To be able to work with arbitrary resource names
|
||||
physnet_resource_mappings and device_plugin_resource_prefix in [sriov]
|
||||
section of kuryr-controller configuration file should be filled. The
|
||||
default value for device_plugin_resource_prefix is intel.com, the same as in
|
||||
SR-IOV network device plugin, in case of SR-IOV network device plugin was
|
||||
started with value of -resource-prefix option different from intel.com, than
|
||||
value should be set to device_plugin_resource_prefix, otherwise
|
||||
kuryr-kubernetes will not work with resource.
|
||||
section of kuryr-controller configuration file should be filled. The
|
||||
default value for device_plugin_resource_prefix is ``intel.com``, the same
|
||||
as in SR-IOV network device plugin, in case of SR-IOV network device plugin
|
||||
was started with value of -resource-prefix option different from
|
||||
``intel.com``, than value should be set to device_plugin_resource_prefix,
|
||||
otherwise kuryr-kubernetes will not work with resource.
|
||||
|
||||
Assume we have following SR-IOV network device plugin (defined by
|
||||
-config-file option)
|
||||
@ -99,18 +156,27 @@ a SR-IOV port on a baremetal installation the 3 following steps should be done:
|
||||
"resourceList":
|
||||
[
|
||||
{
|
||||
"resourceName": "numa0",
|
||||
"resourceName": "physnet22",
|
||||
"rootDevices": ["0000:02:00.0"],
|
||||
"sriovMode": true,
|
||||
"deviceType": "netdevice"
|
||||
"deviceType": "vfio"
|
||||
},
|
||||
{
|
||||
"resourceName": "physnet23",
|
||||
"rootDevices": ["0000:02:00.1"],
|
||||
"sriovMode": true,
|
||||
"deviceType": "vfio"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
We defined numa0 resource name, also assume we started sriovdp with
|
||||
-resource-prefix samsung.com value. The PCI address of ens4f0 interface is
|
||||
"0000:02:00.0". If we assigned 8 VF to ens4f0 and launch SR-IOV network
|
||||
device plugin, we can see following state of kubernetes
|
||||
The config file above describes two physical devices mapped on two
|
||||
resources. Virtual functions from these devices will be used for pods.
|
||||
We defined ``physnet22`` and ``physnet23`` as resource names, also assume
|
||||
we started sriovdp with -resource-prefix intel.com value. The PCI address
|
||||
of ens6 interface is "0000:02:00.0" and the PCI address of ens8 interface
|
||||
is "0000:02:00.1". If we assigned 8 VF to ens6 and 8 VF to ens8 and launch
|
||||
SR-IOV network device plugin, we can see following state of kubernetes:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
@ -120,18 +186,58 @@ a SR-IOV port on a baremetal installation the 3 following steps should be done:
|
||||
"ephemeral-storage": "269986638772",
|
||||
"hugepages-1Gi": "8Gi",
|
||||
"hugepages-2Mi": "0Gi",
|
||||
"samsung.com/numa0": "8",
|
||||
"intel.com/physnet22": "8",
|
||||
"intel.com/physnet23": "8",
|
||||
"memory": "7880620Ki",
|
||||
"pods": "1k"
|
||||
}
|
||||
|
||||
We have to add to the sriov section following mapping:
|
||||
If you use a virtual machine as your worker node, then it is necessary to
|
||||
use sriov-device-plugin of version 3.1 because it provides selectors which
|
||||
are important to separate particular VFs which are passed into VM.
|
||||
|
||||
.. code-block:: ini
|
||||
Config file for sriov-device-plugin may look like:
|
||||
|
||||
[sriov]
|
||||
device_plugin_resource_prefix = samsung.com
|
||||
physnet_resource_mappings = physnet1:numa0
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"resourceList": [{
|
||||
"resourceName": "physnet22",
|
||||
"selectors": {
|
||||
"vendors": ["8086"],
|
||||
"devices": ["1520"],
|
||||
"pfNames": ["ens6"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"resourceName": "physnet23",
|
||||
"selectors": {
|
||||
"vendors": ["8086"],
|
||||
"devices": ["1520"],
|
||||
"pfNames": ["ens8"]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
We defined ``physnet22`` resource name that maps to ``ens6`` interface,
|
||||
which is the first passed into VM virtual function. The same situation is
|
||||
with ``physnet23``, it maps to ``ens8`` interface. It is important to note
|
||||
that in case of virtual machine usage we should specify the names of passed
|
||||
virtual functions as physical devices. Thus we expect sriov-dp to annotate
|
||||
different pci addresses for each resource:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ kubectl get node node1 -o json | jq '.status.allocatable'
|
||||
{
|
||||
"cpu": "4",
|
||||
"ephemeral-storage": "269986638772",
|
||||
"hugepages-2Mi": "2Gi",
|
||||
"intel.com/physnet22": "1",
|
||||
"intel.com/physnet23": "1",
|
||||
"memory": "7880620Ki",
|
||||
}
|
||||
|
||||
#. Enable Kubelet Pod Resources feature
|
||||
|
||||
@ -169,6 +275,19 @@ a SR-IOV port on a baremetal installation the 3 following steps should be done:
|
||||
update ports with binding:profile information. Due to this it is necessary
|
||||
to make actions with privileged user with admin rights.
|
||||
|
||||
#. Use vfio devices in containers
|
||||
|
||||
To use vfio devices inside containers it is necessary to load vfio-pci
|
||||
module. Remember that if our worker node is a virtual machine then it
|
||||
should be loaded without iommu support:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
rmmod vfio_pci
|
||||
rmmod vfio_iommu_type1
|
||||
rmmod vfio
|
||||
modprobe vfio enable_unsafe_noiommu_mode=1
|
||||
modprobe vfio-pci
|
||||
|
||||
.. _NPWG spec: https://docs.openstack.org/kuryr-kubernetes/latest/specs/rocky/npwg_spec_support.html
|
||||
.. _sriov-device-plugin: https://docs.google.com/document/d/1D3dJeUUmta3sMzqw8JtWFoG2rvcJiWitVro9bsfUTEw
|
||||
|
@ -136,10 +136,11 @@ class VIFSriovDriver(health.HealthHandler, b_base.BaseBindingDriver):
|
||||
if driver in constants.USERSPACE_DRIVERS:
|
||||
LOG.info("PCI device %s will be rebinded to userspace network "
|
||||
"driver %s", pci, driver)
|
||||
self._set_vf_mac(pf, vf_index, vif.address)
|
||||
if vif.network.should_provide_vlan:
|
||||
vlan_id = vif.network.vlan
|
||||
self._set_vf_vlan(pf, vf_index, vlan_id)
|
||||
if vf_index and pf:
|
||||
self._set_vf_mac(pf, vf_index, vif.address)
|
||||
if vif.network.should_provide_vlan:
|
||||
vlan_id = vif.network.vlan
|
||||
self._set_vf_vlan(pf, vf_index, vlan_id)
|
||||
old_driver = self._bind_device(pci, driver)
|
||||
self._annotate_device(pod_link, pci, old_driver, driver, port_id)
|
||||
else:
|
||||
@ -152,9 +153,10 @@ class VIFSriovDriver(health.HealthHandler, b_base.BaseBindingDriver):
|
||||
|
||||
def _move_to_netns(self, pci, ifname, netns, vif, vf_name, vf_index, pf,
|
||||
pci_info):
|
||||
if vif.network.should_provide_vlan:
|
||||
vlan_id = vif.network.vlan
|
||||
self._set_vf_vlan(pf, vf_index, vlan_id)
|
||||
if vf_index and pf:
|
||||
if vif.network.should_provide_vlan:
|
||||
vlan_id = vif.network.vlan
|
||||
self._set_vf_vlan(pf, vf_index, vlan_id)
|
||||
|
||||
self._set_vf_mac(pf, vf_index, vif.address)
|
||||
|
||||
@ -180,6 +182,13 @@ class VIFSriovDriver(health.HealthHandler, b_base.BaseBindingDriver):
|
||||
vf_name = vf_names[0]
|
||||
|
||||
pfysfn_path = '/sys/bus/pci/devices/{}/physfn/net/'.format(pci)
|
||||
# If physical function is not specified in VF's directory then
|
||||
# this VF belongs to current VM node
|
||||
if not os.path.exists(pfysfn_path):
|
||||
LOG.info("Current device %s is a virtual function which is "
|
||||
"passed into VM. Getting it's pci info", vf_name)
|
||||
pci_info = self._get_vf_pci_info(pci, vf_name)
|
||||
return vf_name, None, None, pci_info
|
||||
pf_names = os.listdir(pfysfn_path)
|
||||
pf_name = pf_names[0]
|
||||
|
||||
@ -194,6 +203,25 @@ class VIFSriovDriver(health.HealthHandler, b_base.BaseBindingDriver):
|
||||
return vf_name, vf_index, pf_name, pci_info
|
||||
return None, None, None, None
|
||||
|
||||
def _get_vf_pci_info(self, pci, vf_name):
|
||||
vendor_path = '/sys/bus/pci/devices/{}/vendor'.format(pci)
|
||||
with open(vendor_path) as vendor_file:
|
||||
# vendor_full contains a hex value (e.g. 0x8086)
|
||||
vendor_full = vendor_file.read()
|
||||
vendor = vendor_full.split('x')[1].strip()
|
||||
|
||||
device_path = '/sys/bus/pci/devices/{}/device'.format(pci)
|
||||
LOG.info("Full path to device which is being processed",
|
||||
device_path)
|
||||
with open(device_path) as device_file:
|
||||
# device_full contains a hex value (e.g. 0x1520)
|
||||
device_full = device_file.read()
|
||||
device = device_full.split('x')[1].strip()
|
||||
pci_vendor_info = '{}:{}'.format(vendor, device)
|
||||
|
||||
return {'pci_slot': pci,
|
||||
'pci_vendor_info': pci_vendor_info}
|
||||
|
||||
def _bind_device(self, pci, driver, old_driver=None):
|
||||
if not old_driver:
|
||||
old_driver_path = '/sys/bus/pci/devices/{}/driver'.format(pci)
|
||||
|
Loading…
x
Reference in New Issue
Block a user