Nova's use of libvirt's compareCPU() API served its purpose
over the years, but its design limitations break live migration in
subtle ways. For example, the compareCPU() API compares against the
host physical CPUID. Some of the features from this CPUID aren not
exposed by KVM, and then there are some features that KVM emulates that
are not in the host CPUID. The latter can cause bogus live migration
failures.
With QEMU >=2.9 and libvirt >= 4.4.0, libvirt will do the right thing in
terms of CPU compatibility checks on the destination host during live
migration. Nova satisfies these minimum version requirements by a good
margin. So, provide a workaround to skip the CPU comparison check on
the destination host before migrating a guest, and let libvirt handle it
correctly. This workaround will be removed once Nova replaces the older
libvirt APIs with their newer and improved counterparts[1][2].
- - -
Note that Nova's libvirt driver calls compareCPU() in another method,
_check_cpu_compatibility(); I did not remove its usage yet. As it needs
more careful combing of the code, and then:
- where possible, remove the usage of compareCPU() altogether, and
rely on libvirt doing the right thing under the hood; or
- where Nova _must_ do the CPU comparison checks, switch to the better
libvirt CPU APIs -- baselineHypervisorCPU() and
compareHypervisorCPU() -- that are described here[1]. This is work
in progress[2].
[1] https://opendev.org/openstack/nova-specs/commit/70811da221035044e27
[2] https://review.opendev.org/q/topic:bp%252Fcpu-selection-with-hypervisor-consideration
Change-Id: I444991584118a969e9ea04d352821b07ec0ba88d
Closes-Bug: #1913716
Signed-off-by: Kashyap Chamarthy <kchamart@redhat.com>
Signed-off-by: Balazs Gibizer <bgibizer@redhat.com>
Just because we encountered a PortNotFound error when unbinding a port
doesn't mean we should stop unbinding the remaining ports. If this error
is encountered, simply continue with the other ports.
While we're here, we clean up some other tests related to '_unbind_port'
since they're clearly duplicates.
Change-Id: Id04e0df12829df4d8929e03a8b76b5cbe0549059
Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
Closes-Bug: #1974173
This change simply replaces calling notifyAll() with
notify_all() on the threading condition object in
the notification fixture.
notify_all() was introduced in python 2.6
Change-Id: If8d386f20693016dd35ecfdbc703bf31aa103a67
At the moment, if libvirt times out in detaching a device, it
reports this as an ERROR even if the process will be retried
and eventually succeed.
We should just log a warning since there's nothing to do, and
if the process fails after all the retries, it will log an ERROR
anyways.
Closes-Bug: #1972023
Change-Id: Idda12db5758706a97b7841571b9ecd3dc6e6905e
These are currently non-voting since we don't care about this stuff for
Zed. It does get us ready for a 3.10-having future, however.
Change-Id: I7740dafd6523eca27fa4e725d7eaf8558e434779
Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
In Zed cycle we have stopped testing the python3.6|7
means dropped the support and we also bumped the min
python required version in setup.cfg
- Iba5074ea6f981a7527e86cfc98edd1ed7dd3086f
Adding a releasentoes will be good to communicate the same.
Change-Id: I85c1136dd0cd0dce96f8285d3930a31c5a68ead5
The flag was added to prevent the source host neutron agents to trigger
a vif-plugged event too early during a live migration. But actually it
can be used in a more generic sense as the code filter on migration_to
binding profile attribute.
We saw too early vif-plugged events from neutron during evacuation in
post part of the nova-ovs-hybrid-plug job. So this patch enables the
workaround flag for this job too.
Closes-Bug: #1971563
Change-Id: Ifd20ece3a4f126da16f077247c2f1e072edb7163
As If9ab424cc7375a1f0d41b03f01c4a823216b3eb8 stated there is a way for
the pci_device table to become inconsistent. Parent PF can be in
'available' state while children VFs are still in 'unavailable' state.
In this situation the PF is schedulable but the PCI claim will fail
when try to mark the dependent VFs unavailable.
This patch changes the PCI claim logic to allow claiming the parent PF
in the inconsistent situation as we assume that it is safe to do so.
This claim also fixed the inconsistency so that when the parent PF is
freed the children VFs become available again.
Closes-Bug: #1969496
Change-Id: I575ce06bcc913add7db0849f85728371da2032fc
When getting an instance using the compute.API we call
scatter_gather_single_cell() to be able to capture details when we fail
to retrieve a result from a cell such as timeouts and exceptions.
Currently however, we aren't logging the content of an exception if
scatter_gather_single_cell() returns an exception as the result. The
scatter gather method itself logs exceptions that are not of type
NovaException as these represent definite unexpected errors such as
database errors but NovaException handling are left for the caller to
decide whether they want to log it or re-raise it and so on.
It can be difficult to debug a situation where a cell is returning a
NovaException result so this adds logging of the exception content in
the compute API when we encounter an unexpected NovaException.
The existing log message has been updated to more accurately reflect
what has happened (did not respond vs exception). The assignment of the
exception object in scatter gather has also been updated to not
unnecessarily construct a new exception object because it (a) wasn't
necessary and (b) made asserting the LOG.exception() call argument in
the unit test difficult.
Related-Bug: #1970087
Change-Id: Iae1c61c72be5b6017b934293e3dc079a24eeb0e7
The vmwareapi driver uses Managed-Object references throughout the code
with the assumption that they are stable. It is however a database id,
which may change during the runtime of the compute node. e.g. If an
instance is unregistered and re-registerd in the vcenter, the moref will
change. By wrapping a moref in a proxy object, with an additional method
to resolve the openstack object to a moref, we can hide those changes
from a caller.
MoRef implementation with closure - should ease the transition to stable
mo-refs One simply has to pass the search function as a closure to the
MoRef instance, and the very same method will be called when an
exception is raised for the stored reference.
Stable Volume refs - The connection_info['data'] contains the
managed-object reference (moref) as well as the uuid of the volume.
When the moref become invalid for some reason, we can recover it by
searching for the volume-uuid as the `config.instanceUuid` attribute
of the shadow-vm.
Stable VM Ref - By encapsulating all the parameters for searching for
the vm-ref again, we can move the retry logic to the session object,
where we can try to recover the vm-ref should it result in a
ManagedObjectNotFound exception.
Use refs as index for fakedb - It was previously using the object-id
to lookup an object, meaning that you couldn't pass a newly created
Managed-object-reference like you could over the vmware-api. Now the
lookup happens over the ref-id string, and in turn some functions
were refactored to take that into account.
Partial-Bug: #1962771
Change-Id: I2a3ddf95b7fe07630855b06e732f8764efb13e91
When tox 'docs' target is called, first it installs the dependencies
(listed in 'deps') in 'installdeps' phase, then it installs nova (with
its requirements) in 'develop-inst' phase. In the latter case 'deps' is
not used so that the constraints defined in 'deps' are not used.
This could lead to failures on stable branches when new packages are
released that break the build. To avoid this, the simplest solution is
to pre-install requirements, i.e. add requirements.txt to 'docs' tox
target.
Change-Id: I4471d4488d336d5af0c23028724c4ce79d6a2031
As If9ab424cc7375a1f0d41b03f01c4a823216b3eb8 stated there is a way for
the pci_device table to become inconsistent. Parent PF can be in
'available' state while children VFs are still in 'unavailable' state.
In this situation the PF is schedulable but the PCI claim will fail to
when try to mark the dependent VFs unavailable.
This patch adds a test case that shows the error.
Related-Bug: #1969496
Change-Id: I7b432d7a32aeb1ab765d1f731691c7841a8f1440
We saw in the field that the pci_devices table can end up in
inconsistent state after a compute node HW failure and re-deployment.
There could be dependent devices where the parent PF is in available
state while the children VFs are in unavailable state. (Before the HW
fault the PF was allocated hence the VFs was marked unavailable).
In this state this PF is still schedulable but during the
PCI claim the handling of dependent devices in the PCI tracker fill fail
with the error: "Attempt to consume PCI device XXX from empty pool".
The reason of the failure is that when the PF is claimed, all the
children VFs are marked unavailable. But if the VF is already
unavailable such step fails.
One way the deployer might try to recover from this state is to remove
the VFs from the hypervisor and restart the compute agent. The compute
startup already has a logic to delete PCI devices that are unused and
not reported by the hypervisor. However this logic only removed devices
in 'available' state and ignored devices in 'unavailable' state.
If a device is unused and the hypervisor is not reporting the device any
more then it is safe to delete that device from the PCI tracker. So this
patch extends the logic to allow deleting 'unavailable' devices. There
is a small window when dependent PCI device is in 'unclaimable' state.
From cleanup perspective this is an analogous state. So it is also
added to the cleanup logic.
Related-Bug: #1969496
Change-Id: If9ab424cc7375a1f0d41b03f01c4a823216b3eb8
During the testing If9ab424cc7375a1f0d41b03f01c4a823216b3eb8 we noticed
that the unit test cases of PciTracker._set_hvdev are changing and
leaking global state leading to unstable tests.
To reproduce on master, duplicate the
test_set_hvdev_remove_tree_maintained_with_allocations test case and run
PciDevTrackerTestCase serially. The duplicated test case will fail with
File "/nova/nova/objects/pci_device.py", line 238, in _from_db_object
setattr(pci_device, key, db_dev[key])
KeyError: 'id'
This is caused by the fact that the test data is defined on module
level, both _create_tracker and _set_hvdevs modifies the devices
passed to them, and some test mixes passing db dicts to _set_hvdevs
that expects pci dicts from the hypervisor.
This patch fixes multiple related issues:
* always deepcopy what _create_tracker takes as that list is later
returned to the PciTracker via a mock and the tracker might modify
what it got
* ensure that _create_tracker takes db dicts (with id field) while
_set_hvdevs takes pci dicts in the hypervisor format (without id
field)
* always deepcopy what is passed to _set_hvdevs as the PciTracker modify
what it gets.
* normalize when the deepcopy happens to give a safe patter for future
test cases
Change-Id: I20fb4ea96d5dfabfc4be3b5ecec0e4e6c5b3a318
Previously, the libvirt driver's live migration rollback code would
unconditionally refer to migrate_data.vifs. This field would only be
set if the Neutron multiple port bindings extension was in use. When
it is not in use, the reference would fail with a NotImplementedError.
This patch wraps the migrate_data.vifs reference in a conditional that
checks if the vifs field is actually set. This is the only way to do
it, as in the libvirt driver we do not have access to the network
API's has_port_binding_extension() helper.
Closes-bug: 1969980
Change-Id: I48ca6a77de38e3afaa44630e6ae1fd41d2031ba9
When the libvirt driver does live migration rollback of an instance
with network interfaces, it unconditionally refers to
migrate_data.vifs. These will only be set when Neutron has the
multiple port bindings extension. We don't handle the case of the
extension not being present, and currently the rollback will fail with
a "NotImplementedError: Cannot load 'vifs' in the base class" error.
Related-bug: 1969980
Change-Id: Ieef773453ed9f3ced564c1a352fbefbcc6a653ec
Resolves a bug encountered when setting the Nova scheduler to
be aware of Neutron routed provider network segments, by using
'query_placement_for_routed_network_aggregates'.
Non-admin users attempting to access the 'segment_id' attribute
of a subnet caused a traceback, resulting in instance creation
failure.
This patch ensures the Neutron client is initialised with an
administrative context no matter what the requesting user's
permissions are.
Change-Id: Ic0f25e4d2395560fc2b68f3b469e266ac59abaa2
Closes-Bug: #1970383
... because functionality of this parameter is effectively duplicate of
the HTTPProxyToWSGI middleware in oslo.middleware library.
Closes-Bug: #1967686
Change-Id: Ifebcfb6b5c1594c075bb9c152a06aa7af7c61bc8
The VMwareAPISession object is not only used by the driver, but in
practically all modules of vmwareapi. It reduces a bit the scope of
the driver module itself.
Partial-Bug: #1962771
Change-Id: I4094b6031872bd3b5c871b9a82c7e01280a3352d