Trigger instance recovery audit when host goes offline

When the VIM handles a force-lock operation, it tells nova to fail
the instances and reschedules the instance recovery audit for a 30s
audit (reschedule_audit_instances), because the audit will evacuate
the instances once the host goes offline. However, if the previous
audit happened more than 30s earlier (the normal interval is 330s),
the audit will trigger right away. At this point, nova has not yet
failed the instances, so the recovery audit runs
(recover_instances) and since it doesn't see any failed instances,
it does nothing and it schedules the next audit for 330s. By that
time, the host could have come back online and the evacuates cannot
be done at that time (since the host must be offline to do an
evacuate).

The solution is to call recover_instances once the host goes
offline. This will have the effect of setting the audit interval
to 30s. When the audit runs the next time, it will see the
instances are failed and evacuate them.

Story: 2002860
Task: 22809

Change-Id: I80473d6f41850f9cfc7be8125fe8fda4fdc5a56c
Signed-off-by: Don Penney <don.penney@windriver.com>
This commit is contained in:
Bart Wensley 2018-06-18 07:25:18 -05:00 committed by Jack Ding
parent 1348e0ad59
commit d1215497a4

View File

@ -396,6 +396,9 @@ class HostDirector(object):
% host.name)
instance_director = directors.get_instance_director()
instance_director.host_offline(host)
# Now that the host is offline, we may be able to recover instances
# on that host (i.e. evacuate them).
instance_director.recover_instances()
@staticmethod
def host_audit(host):