This reverts commit 5a10047f9d.
This gets us back to Ib0cf5d55750f13d0499a570f14024dca551ed4d4
which was meant to address an issue introduced
by Id188d48609f3d22d14e16c7f6114291d547a8986.
So we essentially had three changes:
1. Hard reboot would blow away volumes and vifs and then wait for the
vifs to be plugged; this caused a problem for some vif types (
linuxbridge was reported) because the event never came and we
timed out.
2. To workaround that, a second change was made to simply not wait for
vif plugging events.
3. Since #2 was a bit heavy-handed for a problem that didn't impact
openvswitch, another change was made to only wait for non-bridge vif
types, so we'd wait for OVS.
But it turns out that opendaylight is an OVS vif type and doesn't send
events for plugging the vif, only for binding the port (and we don't
re-bind the port during reboot). There is also a report of this being a
problem for other types of ports, see
If209f77cff2de00f694b01b2507c633ec3882c82.
So rather than try to special-case every possible vif type that could
be impacted by this, we are simply reverting the change so we no longer
wait for vif plugged events during hard reboot.
Note that if we went back to Id188d48609f3d22d14e16c7f6114291d547a8986
and tweaked that to not unplug/plug the vifs we wouldn't have this
problem either, and that change was really meant to deal with an
encrypted volume issue on reboot. But changing that logic is out of the
scope of this change. Alternatively, we could re-bind the port during
reboot but that could have other implications, or neutron could put
something into the port details telling us which vifs will send events
and which won't, but again that's all outside of the scope of this
patch.
Change-Id: Ib3f10706a7191c58909ec1938042ce338df4d499
Closes-Bug: #1755890
When spawning an Hyper-V instance with NICs having the vif_type "hyperv",
neutron will fail to bind the port to the Hyper-V host if the neutron
server doesn't have the "hyperv" mechanism driver installed and configured,
resulting in a PortBindingFailed exception on the nova-compute side.
When this exception is encountered, the logs will say to check the
neutron-server logs, but the problem and its solution are not obvious
or clear, resulting in plenty of questions / reports, all having the
same solution: is there an L2 agent on the host alive and reporting to
neutron, and if neutron Hyper-V agent is used, make sure to install
networking-hyperv and configure neutron-server to use the "hyperv"
mechanism_driver.
Change-Id: Idceeb08e1452413e3b10ecd0a65f71d4d82866e0
Closes-Bug: #1744032
(cherry picked from commit b80c245ba5)
When booting instances, nova might create neutron resources. For that,
the network service endpoint needs to be available. Otherwise we run
into:
EndpointNotFound: ['internal', 'public'] endpoint for network service \
not found
Change-Id: Iaed84826b76ab976ffdd1c93106b7bae700a64a9
Closes-Bug: #1752289
(cherry picked from commit 3a3b0f09db)
With the addition of multiattach we need to ensure that we
don't make brick calls to remove connections on detach volume
if that volume is attached to another Instance on the same
node.
This patch adds a new helper method (_should_disconnect_target)
to the virt driver that will inform the caller if the specified
volume is attached multiple times to the current host.
The general strategy for this call is to fetch a current reference
of the specified volume and then:
1. Check if that volume has >1 active attachments
2. Fetch the attachments for the volume and extract the server_uuids
for each of the attachments.
3. Check the server_uuids against a list of all known server_uuids
on the current host. Increment a connection_count for each item
found.
If the connection_count is >1 we return `False` indicating that the
volume is being used by more than one attachment on the host and
we therefore should NOT destroy the connection.
*NOTE*
This scenario is very different than the `shared_targets`
case (for which we supply a property on the Volume object). The
`shared_targets` scenario is specifically for Volume backends that
present >1 Volumes using a single Target. This mechanism is meant
to provide a signal to consumers that locking is required for the
creation and deletion of initiator/target sessions.
Closes-Bug: #1752115
Change-Id: Idc5cecffa9129d600c36e332c97f01f1e5ff1f9f
(cherry picked from commit 139426d514)
We need this in a later change to pull volume attachment
information from cinder for the volume being detached so
that we can do some attachment counting for multiattach
volumes being detached from instances on the same host.
Change-Id: I751fcb7532679905c4279744919c6cce84a11eb4
Related-Bug: #1752115
(cherry picked from commit d2941bfd16)
If the boot-volume creation fails, the data volume is left in state
"in-use", attached to the server which is now in "error" state.
The user can't detach the volume because of the server's error state.
They can delete the server, which then leaves the volume apparently
attached to a server that no longer exists, which is being fixed
separately: https://review.openstack.org/#/c/340614/
The only way out of this is to ask an administrator to reset the state of
the data volume (this option is not available to regular users by
default policy).
This change fixes the problem in the compute service such that
when the creation fails, compute manager detaches the created volumes
before putting the VM into error state. Then you can delete the instance
without care about attached volumes.
Change-Id: I8b1c05317734e14ea73dc868941351bb31210bf0
Closes-bug: #1633249
(cherry picked from commit 61f6751a18)
If spawning fails when unshelving, terminate the volumes' connections
with the node, and remove the node reference from the instance entry.
Co-Authored-By: Matt Riedemann <mriedem.os@gmail.com>
Closes-Bug: 1627694
Change-Id: I8cfb2280d956d452ccad1fc711bd814b7258147f
(cherry picked from commit dcdd2c9832)
During cold resize, the ComputeManager's prep_resize calls the
rpcapi.ComputeAPI's resize_instance method, which will then do an
RPC cast (async).
Because the RPC cast is asynchronous, the exception branch in prep_resize
will not be executed if the cold resize failed, the allocations will not
be cleaned up, and the instance will not be rescheduled.
This patch adds allocation cleanup in the resize_instance and finish_resize
methods.
Change-Id: I2d9ab06b485f76550dbbff46f79f40ff4c97d12f
Closes-Bug: #1749215
(cherry picked from commit caf167862d)
noVNC 1.0.0 has the fix for non-US key mappings so this adds a simple
note when installing the novnc package that at least 1.0.0 should be
used for non-US key map support.
Change-Id: Ia1a84c986025f8a46c1062440faa0deb1d2d73a5
Related-Bug: #1682020
(cherry picked from commit ed7af4c8f4)
Add functional tests for traits API in the following cases.
* Invalid 'resource_provider_generation' in
PUT /resource_providers/{uuid}/traits
* Invalid 'traits' in
PUT /resource_providers/{uuid}/traits
* Additional properties in
PUT /resource_providers/{uuid}/traits
* Earlier microverion (1.5)
Fix a response string to check in the following test.
* Missing 'resource_provider_generation' in
PUT /resource_providers/{uuid}/traits
Change-Id: I4db0c8a5c55f7fcdebd5fcb04273d922727a4521
(cherry picked from commit c23f135b11)
Add the 'X-Openstack-Request-Id' header
in the request of PUT in SchedulerReportClient.
When removing a resource provider from an instance allocation
and putting allocations to a resource provider,
the header is added.
Subsequent patches will add the header in the other cases.
Conflicts:
nova/scheduler/client/report.py
NOTE(takashin): The conflict is due to
I2563213dc3b184049af336f0807c4cb4690a9f9a is not backported.
Change-Id: I7891b98f225f97ad47f189afb9110ef31c810717
Partial-Bug: #1734625
(cherry picked from commit f66bd7369a)
Under certain failure scenarios it may be that although the libvirt
definition for the volume has been removed for the instance that the
associated storage lun on the compute server may not have been fully
cleaned up yet.
In case users try an other attempt to detach volume we should not stop
the process whether the device is not found in domain definition but
try to disconnect the logical device from host.
This commit makes the process to attempt a disconnect volume even if
the device is not attached to the guest.
Closes-Bug: #1727260
Change-Id: I4182642aab3fd2ffb1c97d2de9bdca58982289d8
Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@redhat.com>
(cherry picked from commit ce531dd1b7)
If an instance is deleted before it is scheduled, the BDM
clean-up code uses the mappings from the build request as
they don't exist in the database yet.
When using the older attachment flow with reserve_volume,
there is no attachment_id bound to the block device mapping
and because it is not loaded from database but rather from
the build request, accessing the attachment_id field raises
an exception with 'attachment_id not lazy-loadable'
If we did a new style attach, _validate_bdm will add the
attachment_id from Cinder. If we did not, then this patch
will make sure to set it to 'None' to avoid raising an
exception when checking if we have an attachment_id set in
the BDM clean-up code
Change-Id: I3cc775fc7dafe691b97a15e50ae2e93c92f355be
Closes-Bug: #1750666
(cherry picked from commit 16c2c8b3ee)
When creating a new instance and deleting it before it gets scheduled
with the old attachment flow (reserve_volume), the block device mappings
are not persisted to database which means that the clean up fails
because it tries to lookup attachment_id which cannot be lazy loaded.
This patch adds a (failing) functional test to check for this issue
which will be addressed in a follow-up patch.
Related-Bug: #1750666
Change-Id: I294c54e5a22dd6e5b226a4b00e7cd116813f0704
(cherry picked from commit 3120627d98)
As privsep uses the msgpack to send method arguments to the privsep
daemon, we could not use anymore custom data type like
nova.objects.instance.Instance.
Change-Id: I09f04d5b2f1cb39339ad7c4569186db5d361797a
Closes-Bug: #1742963
(cherry picked from commit 1f5fe3190b)
It was discussed and decided [1] that we only want to be pulling down,
caching, and passing to update_provider_tree providers associated via
aggregate with the compute node's provider tree if they are sharing
providers. Otherwise we'll get e.g. all the *other* compute nodes which
are also associated with a sharing provider.
[1] https://review.openstack.org/#/c/540111/4/specs/rocky/approved/update-provider-tree.rst@48
Closes-Bug: #1750084
(cherry picked from commit d2152f3094)
Conflicts:
nova/scheduler/client/report.py
nova/tests/functional/api/openstack/placement/test_report_client.py
nova/tests/unit/scheduler/client/test_report.py
NOTE(efried): Conflicts due to changes in this series:
https://review.openstack.org/#/q/project:openstack/nova+topic:bug/1734625+branch:master
Change-Id: Iab366da7623e5e31b8416e89fee7d418f7bf9b30
Because of the big Exception block in _validate_bdm, the
multiattach-specific errors raised out of the
_check_attach_and_reserve_volume method were being lost
and the very generic InvalidBDMVolume was returned to the
user.
For example, I hit this when trying to create a server from
a multiattach volume but forgot to specify microversion 2.60
and it was just telling me it couldn't get the volume, which
I knew was bogus since I could get the volume details.
The fix is to handle the specific errors we want to re-raise.
The tests, which missed this because of their high-level mocking,
are updated so that we actually get to the problematic code and
only the things we don't care about along the way are mocked out.
Change-Id: I0b397e5bcdfd635fa562beb29819dd8c6b828e8a
Closes-Bug: #1750064
(cherry picked from commit 5754ac0ab3)
If the user creates a volume-backed server from an existing
volume, the API reserves the volume by creating an attachment
against it. This puts the volume into 'attaching' status.
If the user then deletes the server before it's created in a
cell, by deleting the build request, the attached volume is
orphaned and requires admin intervention in the block storage
service.
This change simply pulls the BDMs off the BuildRequest when
we delete the server via the build request and does the same
local cleanup of those volumes as we would in a "normal" local
delete scenario that the instance was created in a cell but
doesn't have a host.
We don't have to worry about ports in this scenario since
ports are created on the compute, in a cell, and if we're
deleting a build request then we never got far enough to
create ports.
Change-Id: I1a576bdb16befabe06a9728d7adf008fc0667077
Partial-Bug: #1404867
(cherry picked from commit 0652e4ab3d)
This is another wrinkle for bug 1404867 where we create a
volume-backed server, create an attachment on the volume which
puts the volume in 'attaching' status, and then delete the server
before it's actually created in a cell.
In this case, the _delete_while_booting code in the compute API
finds and deletes the BuildRequest before the instance was ever
created in a cell.
The bug is that _delete_while_booting in the API doesn't also
process block device mappings and unreserve/delete attachments
on the volume, which orphans the volume and can only be fixed
with admin intervention in the block storage service.
Change-Id: Ib65acc671711eae7aee65df9cd5c6b2ccb559f5c
Related-Bug: #1404867
(cherry picked from commit 08f0f71a83)
Usually, when instance.host = None, it means the instance was never
scheduled. However, the exception handling routine in compute manager
[1] will set instance.host = None and set instance.vm_state = ERROR
if the instance fails to build on the compute host. If that happens, we
end up with an instance with host = None and vm_state = ERROR which may
have ports and volumes still allocated.
This adds some logic around deleting the instance when it may have
ports or volumes allocated.
1. If the instance is not in ERROR or SHELVED_OFFLOADED state, we
expect instance.host to be set to a compute host. So, if we find
instance.host = None in states other than ERROR or
SHELVED_OFFLOADED, we consider the instance to have failed
scheduling and not require ports or volumes to be freed, and we
simply destroy the instance database record and return. This is
the "delete while booting" scenario.
2. If the instance is in ERROR because of a failed build or is
SHELVED_OFFLOADED, we expect instance.host to be None even though
there could be ports or volumes allocated. In this case, run the
_local_delete routine to clean up ports and volumes and delete the
instance database record.
Co-Authored-By: Ankit Agrawal <ankit11.agrawal@nttdata.com>
Co-Authored-By: Samuel Matzek <smatzek@us.ibm.com>
Co-Authored-By: melanie witt <melwittt@gmail.com>
Closes-Bug: 1404867
Closes-Bug: 1408527
Conflicts:
nova/tests/unit/compute/test_compute_api.py
[1] https://github.com/openstack/nova/blob/55ea961/nova/compute/manager.py#L1927-L1929
Change-Id: I4dc6c8bd3bb6c135f8a698af41f5d0e026c39117
(cherry picked from commit b3f39244a3)
In certain cases, such as when an instance fails to be scheduled,
the volume may already have an attachment created (or the volume
has been reserved in the old flows).
This patch adds a test to check that these volume attachments
are deleted and removed once the instance has been deleted. It
also adds some functionality to allow checking when an volume
has been reserved in the Cinder fixtures.
Change-Id: I85cc3998fbcde30eefa5429913ca287246d51255
Related-Bug: #1404867
(cherry picked from commit 20edeb3623)
If an instance fails to get scheduled, it gets buried in cell0 but
none of it's block device mappings are stored. At the API layer,
Nova reserves and creates attachments for new instances when
it gets a create request so these attachments are orphaned if the
block device mappings are not registered in the database somewhere.
This patch makes sure that if an instance is being buried in cell0,
all of it's block device mappings are recorded as well so they can
be later removed when the instance is deleted.
Change-Id: I64074923fb741fbf5459f66b8ab1a23c16f3303f
Related-Bug: #1404867
(cherry picked from commit ad9e2a568f)
There is an existing loop which sets the proper value for status
and attach_status right above, so this is doing nothing other than
changing it to the incorrect value.
Co-Authored-By: Mohammed Naser <mnaser@vexxhost.com>
Change-Id: Iea0c1ea0a699b9519f66977391202956f17aac66
(cherry picked from commit 8cd64670ea)
When creating a snapshot of a volume-backed instance, we
create a snapshot of every volume BDM associated with the
instance. The default volume snapshot limit is 10, so if
you have a volume-backed instance with several volumes attached
and snapshot it a few times, you're likely to fail the
volume snapshot at some point with an OverLimit error from
Cinder. This can lead to orphaned volume snapshots in Cinder
that the user then has to cleanup.
This change makes the snapshot operation a bit more robust by
first checking the quota limit and current usage for the given
project before attempting to create any volume snapshots.
It's not fail-safe since we could still fail with racing snapshot
requests for the same project, but it's a simple improvement to
avoid this issue in the general case.
Change-Id: I4e7b46deb43c0c2430b480f1a498a52fc4a9daf0
Related-Bug: #1731986
(cherry picked from commit 289d2703c7)
This will be used in a later patch to check quota usage
for volume snapshots before attempting to create new
volume snapshots, so we can avoid an OverLimit error.
Change-Id: Ica7c087708e86494d285fc3905a5740fd1356e5f
Related-Bug: #1731986
(cherry picked from commit ad389244ba)