Change consecutive build failure limit to a weigher
There is concern over the ability for compute nodes to reasonably determine which events should count against its consecutive build failures. Since the compute may erronenously disable itself in response to mundane or otherwise intentional user-triggered events, this patch adds a scheduler weigher that considers the build failure counter and can negatively weigh hosts with recent failures. This avoids taking computes fully out of rotation, rather treating them as less likely to be picked for a subsequent scheduling operation. This introduces a new conf option to control this weight. The default is set high to maintain the existing behavior of picking nodes that are not experiencing high failure rates, and resetting the counter as soon as a single successful build occurs. This is minimal visible change from the existing behavior with default configuration. The rationale behind the default value for this weigher comes from the values likely to be generated by its peer weighers. The RAM and Disk weighers will increase the score by number of available megabytes of memory (range in thousands) and disk (range in millions). The default value of 1000000 for the build failure weigher will cause competing nodes with similar amounts of available disk and a small (less than ten) number of failures to become less desirable than those without, even with many terabytes of available disk. Change-Id: I71c56fe770f8c3f66db97fa542fdfdf2b9865fb8 Related-Bug: #1742102changes/95/572195/6
parent
f902e0d5d8
commit
91e29079a0
@ -0,0 +1,33 @@
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License"); you may
|
||||
# not use this file except in compliance with the License. You may obtain
|
||||
# a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
||||
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
||||
# License for the specific language governing permissions and limitations
|
||||
# under the License.
|
||||
"""
|
||||
BuildFailure Weigher. Weigh hosts by the number of recent failed boot attempts.
|
||||
|
||||
"""
|
||||
|
||||
import nova.conf
|
||||
from nova.scheduler import weights
|
||||
|
||||
CONF = nova.conf.CONF
|
||||
|
||||
|
||||
class BuildFailureWeigher(weights.BaseHostWeigher):
|
||||
def weight_multiplier(self):
|
||||
"""Override the weight multiplier. Note this is negated."""
|
||||
return -1 * CONF.filter_scheduler.build_failure_weight_multiplier
|
||||
|
||||
def _weigh_object(self, host_state, weight_properties):
|
||||
"""Higher weights win. Our multiplier is negative, so reduce our
|
||||
weight by number of failed builds.
|
||||
"""
|
||||
return host_state.failed_builds
|
@ -0,0 +1,57 @@
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License"); you may
|
||||
# not use this file except in compliance with the License. You may obtain
|
||||
# a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
||||
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
||||
# License for the specific language governing permissions and limitations
|
||||
# under the License.
|
||||
"""
|
||||
Tests For Scheduler build failure weights.
|
||||
"""
|
||||
|
||||
from nova.scheduler import weights
|
||||
from nova.scheduler.weights import compute
|
||||
from nova import test
|
||||
from nova.tests.unit.scheduler import fakes
|
||||
|
||||
|
||||
class BuildFailureWeigherTestCase(test.NoDBTestCase):
|
||||
def setUp(self):
|
||||
super(BuildFailureWeigherTestCase, self).setUp()
|
||||
self.weight_handler = weights.HostWeightHandler()
|
||||
self.weighers = [compute.BuildFailureWeigher()]
|
||||
|
||||
def _get_weighed_host(self, hosts):
|
||||
return self.weight_handler.get_weighed_objects(self.weighers,
|
||||
hosts, {})
|
||||
|
||||
def _get_all_hosts(self):
|
||||
host_values = [
|
||||
('host1', 'node1', {'failed_builds': 0}),
|
||||
('host2', 'node2', {'failed_builds': 1}),
|
||||
('host3', 'node3', {'failed_builds': 10}),
|
||||
('host4', 'node4', {'failed_builds': 100})
|
||||
]
|
||||
return [fakes.FakeHostState(host, node, values)
|
||||
for host, node, values in host_values]
|
||||
|
||||
def test_build_failure_weigher_disabled(self):
|
||||
self.flags(build_failure_weight_multiplier=0.0,
|
||||
group='filter_scheduler')
|
||||
hosts = self._get_all_hosts()
|
||||
weighed_hosts = self._get_weighed_host(hosts)
|
||||
self.assertTrue(all([wh.weight == 0.0
|
||||
for wh in weighed_hosts]))
|
||||
|
||||
def test_build_failure_weigher_scaled(self):
|
||||
self.flags(build_failure_weight_multiplier=1000.0,
|
||||
group='filter_scheduler')
|
||||
hosts = self._get_all_hosts()
|
||||
weighed_hosts = self._get_weighed_host(hosts)
|
||||
self.assertEqual([0, -10, -100, -1000],
|
||||
[wh.weight for wh in weighed_hosts])
|
@ -0,0 +1,23 @@
|
||||
---
|
||||
security:
|
||||
- |
|
||||
To mitigate potential issues with compute nodes disabling
|
||||
themselves in response to failures that were either non-fatal or
|
||||
user-generated, the consecutive build failure counter
|
||||
functionality in the compute service has been changed to advise
|
||||
the scheduler of the count instead of self-disabling the service
|
||||
upon exceeding the threshold. The
|
||||
``[compute]/consecutive_build_service_disable_threshold``
|
||||
configuration option still controls whether the count is tracked,
|
||||
but the action taken on this value has been changed to a scheduler
|
||||
weigher. This allows the scheduler to be configured to weigh hosts
|
||||
with consecutive failures lower than other hosts, configured by the
|
||||
``[filter_scheduler]/build_failure_weight_multiplier`` option. If
|
||||
the compute threshold option is nonzero, computes will report their
|
||||
failure count for the scheduler to consider. If the threshold
|
||||
value is zero, then computes will not report this value
|
||||
and the scheduler will assume the number of failures for
|
||||
non-reporting compute nodes to be zero. By default, the scheduler
|
||||
weigher is enabled and configured with a very large multiplier to
|
||||
ensure that hosts with consecutive failures are scored low by
|
||||
default.
|
Loading…
Reference in New Issue