The conductor migrate_server RPC method is a blocking RPC call
used by both the API during a resize / cold migrate request and
by the compute service if rescheduling from a failed prep_resize
operation on the selected dest host (or alternate).
Currently the RPC call is using the global rpc_response_timeout
which defaults to 60 seconds. When coming from the API request,
we're going from API to conductor to scheduler and don't return
the response to the API caller until conductor casts to the
first selected destination host's prep_resize method. In a large
deployment, or under heavy load on the control plane, this could
take long enough to trip the rpc_response_timeout and result in
a MessagingTimeout 500 error response to the user.
Reschedules from the compute should be faster since they don't
involve a roundtrip call to the scheduler (since we have alternate
selections since Queens).
This makes the migrate_server method use the long_rpc_timeout
config for the overall timeout which defaults to 1800 seconds.
The rpc_response_timeout becomes the heartbeat value to make sure
the call is still alive.
This was noticed during at least one particularly slow resize
call that timed out in the gate [1].
Related-Bug: #1763070
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010494.html
Change-Id: I9115ef6df59844cd6e702f19ba38ffbf9f8b35d3