nova/releasenotes/notes/change-consecutive-boot-failure-counter-to-weigher-428de7da0ed2033a.yaml
Dan Smith 91e29079a0 Change consecutive build failure limit to a weigher
There is concern over the ability for compute nodes to reasonably
determine which events should count against its consecutive build
failures. Since the compute may erronenously disable itself in
response to mundane or otherwise intentional user-triggered events,
this patch adds a scheduler weigher that considers the build failure
counter and can negatively weigh hosts with recent failures. This
avoids taking computes fully out of rotation, rather treating them as
less likely to be picked for a subsequent scheduling
operation.

This introduces a new conf option to control this weight. The default
is set high to maintain the existing behavior of picking nodes that
are not experiencing high failure rates, and resetting the counter as
soon as a single successful build occurs. This is minimal visible
change from the existing behavior with default configuration.

The rationale behind the default value for this weigher comes from the
values likely to be generated by its peer weighers. The RAM and Disk
weighers will increase the score by number of available megabytes of
memory (range in thousands) and disk (range in millions). The default
value of 1000000 for the build failure weigher will cause competing
nodes with similar amounts of available disk and a small (less than ten)
number of failures to become less desirable than those without, even
with many terabytes of available disk.

Change-Id: I71c56fe770f8c3f66db97fa542fdfdf2b9865fb8
Related-Bug: #1742102
2018-06-06 15:18:50 -07:00

24 lines
1.2 KiB
YAML

---
security:
- |
To mitigate potential issues with compute nodes disabling
themselves in response to failures that were either non-fatal or
user-generated, the consecutive build failure counter
functionality in the compute service has been changed to advise
the scheduler of the count instead of self-disabling the service
upon exceeding the threshold. The
``[compute]/consecutive_build_service_disable_threshold``
configuration option still controls whether the count is tracked,
but the action taken on this value has been changed to a scheduler
weigher. This allows the scheduler to be configured to weigh hosts
with consecutive failures lower than other hosts, configured by the
``[filter_scheduler]/build_failure_weight_multiplier`` option. If
the compute threshold option is nonzero, computes will report their
failure count for the scheduler to consider. If the threshold
value is zero, then computes will not report this value
and the scheduler will assume the number of failures for
non-reporting compute nodes to be zero. By default, the scheduler
weigher is enabled and configured with a very large multiplier to
ensure that hosts with consecutive failures are scored low by
default.