If an AWS spot instance is used as a metastatic backing node, an
unexpected series of events can occur:
* aws driver creates backing node instance
* aws driver scans ssh keys and stores them on backing node
* aws reclaims spot instance
* aws re-uses IP from backing node
* metastatic driver creates node
* metastatic driver scans ssh keys and stores them on node
Zuul would then use the wrong node (whether that succeeds depends
on what else has happened to the node in the interim).
To avoid this situation, we implement this change:
* After scanning the metastatic node ssh keys, we compare them to
the backing node ssh keys and if they differ, trigger an error
in the metastatic node and mark the backing node as failed.
In case the node is reclaimed one step early in the above sequence,
we implement this change:
* After completing the nodescan, the aws driver will double check
that the instance is still running; if not, it will trigger an
error.
The above is still subject to a small race if the nodescan time
takes less than the cache interval of the instance list, and the
node is reclaimed after the nodescan and within the cache interval
(currently 10 seconds). In the unlikely event that does happen,
then the metastatic key check should still catch the issue as long
as the replacement node also does not boot within those 10 seconds.
(Technically possible, but the combination of all of these things
should be very unlikely in practice.)
Change-Id: I9ce1f6df04e9c49deceda99c8e4024dd98ea88f9