Wait before deleting trunk bridges for DPDK vhu

DPDK vhostuser mode (DPDK/vhu) means that when an instance is powered
off the port is deleted, and when an instance is powered on a port is
created.  This means a reboot is functionally a super fast
delete-then-create.  Neutron trunking mode in combination with DPDK/vhu
implements a trunk bridge for each tenant, and the ports for the
instances are created as subports of that bridge.  The standard way a
trunk bridge works is that when all the subports are deleted, a thread
is spawned to delete the trunk bridge, because that is an expensive and
time-consuming operation.  That means that if the port in question is
the only port on the trunk on that compute node, this happens:

1. The port is deleted
2. A thread is spawned to delete the trunk
3. The port is recreated

If the trunk is deleted after #3 happens then the instance has no
networking and is inaccessible; this is the scenario that was dealt with
in a previous change [1].  But there continue to be issues with errors
"RowNotFound: Cannot find Bridge with name=tbr-XXXXXXXX-X".  What is
happening in this case is that the trunk is being deleted in the middle
of the execution of #3, so that it stops existing in the middle of the
port creation logic but before the port is actually recreated.

Since this is a timing issue between two different threads it's
difficult to stamp out entirely, but I think the best way to do it is to
add a slight delay in the trunk deletion thread, just a second or two.
That will give the port time to come back online and avoid the trunk
deletion entirely.

[1] https://review.opendev.org/623275

Related-Bug: #1869244
Change-Id: I36a98fe5da85da1f3a0315dd1a470f062de6f38b
(cherry picked from commit e37722c0f5)
This commit is contained in:
Nate Johnston 2020-03-24 18:05:16 -04:00 committed by Nate Johnston
parent 9ab3d21789
commit c0b15a8cbe
1 changed files with 3 additions and 0 deletions

View File

@ -14,6 +14,7 @@
# under the License.
import functools
import time
import eventlet
from neutron_lib.callbacks import events
@ -42,6 +43,7 @@ from neutron.services.trunk.rpc import agent
LOG = logging.getLogger(__name__)
DEFAULT_WAIT_FOR_PORT_TIMEOUT = 60
WAIT_BEFORE_TRUNK_DELETE = 3
def lock_on_bridge_name(required_parameter):
@ -215,6 +217,7 @@ class OVSDBHandler(object):
# try to mitigate the issue by checking if there is a port on the
# bridge and if so then do not remove it.
bridge = ovs_lib.OVSBridge(bridge_name)
time.sleep(WAIT_BEFORE_TRUNK_DELETE)
if bridge_has_instance_port(bridge):
LOG.debug("The bridge %s has instances attached so it will not "
"be deleted.", bridge_name)