5af632e9ca
Conductor RPC calls the scheduler to get hosts during server create, which in a multi-create request with a lot of servers and the default rpc_response_timeout, can trigger a MessagingTimeout. Due to the old retry_select_destinations decorator, conductor will retry the select_destinations RPC call up to max_attempts times, so thrice by default. This can clobber the scheduler and placement while the initial scheduler worker is still trying to process the beefy request and allocate resources in placement. This has been recreated in a devstack test patch [1] and shown to fail with 1000 instances in a single request with the default rpc_response_timeout of 60 seconds. Changing the rpc_response_timeout to 300 avoids the MessagingTimeout and retry loop. Since Rocky we have the long_rpc_timeout config option which defaults to 1800 seconds. The RPC client can thus be changed to heartbeat the scheduler service during the RPC call every $rpc_response_timeout seconds with a hard timeout of $long_rpc_timeout. That change is made here. As a result, the problematic retry_select_destinations decorator is also no longer necessary and removed here. That decorator was added in I2b891bf6d0a3d8f45fd98ca54a665ae78eab78b3 and was a hack for scheduler high availability where a MessagingTimeout was assumed to be a result of the scheduler service dying so retrying the request was reasonable to hit another scheduler worker, but is clearly not sufficient in the large multi-create case, and long_rpc_timeout is a better fit for that HA type scenario to heartbeat the scheduler service. [1] https://review.openstack.org/507918/ Change-Id: I87d89967bbc5fbf59cf44d9a63eb6e9d477ac1f3 Closes-Bug: #1795992
48 lines
1.2 KiB
Python
48 lines
1.2 KiB
Python
# Copyright 2018 OpenStack Foundation
|
|
# All Rights Reserved.
|
|
#
|
|
# Licensed under the Apache License, Version 2.0 (the "License"); you may
|
|
# not use this file except in compliance with the License. You may obtain
|
|
# a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
|
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
|
# License for the specific language governing permissions and limitations
|
|
# under the License.
|
|
|
|
from oslo_config import cfg
|
|
|
|
rpc_opts = [
|
|
cfg.IntOpt("long_rpc_timeout",
|
|
default=1800,
|
|
help="""
|
|
This option allows setting an alternate timeout value for RPC calls
|
|
that have the potential to take a long time. If set, RPC calls to
|
|
other services will use this value for the timeout (in seconds)
|
|
instead of the global rpc_response_timeout value.
|
|
|
|
Operations with RPC calls that utilize this value:
|
|
|
|
* live migration
|
|
* scheduling
|
|
|
|
Related options:
|
|
|
|
* rpc_response_timeout
|
|
"""),
|
|
]
|
|
|
|
|
|
ALL_OPTS = rpc_opts
|
|
|
|
|
|
def register_opts(conf):
|
|
conf.register_opts(ALL_OPTS)
|
|
|
|
|
|
def list_opts():
|
|
return {'DEFAULT': ALL_OPTS}
|