91e29079a0
There is concern over the ability for compute nodes to reasonably determine which events should count against its consecutive build failures. Since the compute may erronenously disable itself in response to mundane or otherwise intentional user-triggered events, this patch adds a scheduler weigher that considers the build failure counter and can negatively weigh hosts with recent failures. This avoids taking computes fully out of rotation, rather treating them as less likely to be picked for a subsequent scheduling operation. This introduces a new conf option to control this weight. The default is set high to maintain the existing behavior of picking nodes that are not experiencing high failure rates, and resetting the counter as soon as a single successful build occurs. This is minimal visible change from the existing behavior with default configuration. The rationale behind the default value for this weigher comes from the values likely to be generated by its peer weighers. The RAM and Disk weighers will increase the score by number of available megabytes of memory (range in thousands) and disk (range in millions). The default value of 1000000 for the build failure weigher will cause competing nodes with similar amounts of available disk and a small (less than ten) number of failures to become less desirable than those without, even with many terabytes of available disk. Change-Id: I71c56fe770f8c3f66db97fa542fdfdf2b9865fb8 Related-Bug: #1742102
24 lines
1.2 KiB
YAML
24 lines
1.2 KiB
YAML
---
|
|
security:
|
|
- |
|
|
To mitigate potential issues with compute nodes disabling
|
|
themselves in response to failures that were either non-fatal or
|
|
user-generated, the consecutive build failure counter
|
|
functionality in the compute service has been changed to advise
|
|
the scheduler of the count instead of self-disabling the service
|
|
upon exceeding the threshold. The
|
|
``[compute]/consecutive_build_service_disable_threshold``
|
|
configuration option still controls whether the count is tracked,
|
|
but the action taken on this value has been changed to a scheduler
|
|
weigher. This allows the scheduler to be configured to weigh hosts
|
|
with consecutive failures lower than other hosts, configured by the
|
|
``[filter_scheduler]/build_failure_weight_multiplier`` option. If
|
|
the compute threshold option is nonzero, computes will report their
|
|
failure count for the scheduler to consider. If the threshold
|
|
value is zero, then computes will not report this value
|
|
and the scheduler will assume the number of failures for
|
|
non-reporting compute nodes to be zero. By default, the scheduler
|
|
weigher is enabled and configured with a very large multiplier to
|
|
ensure that hosts with consecutive failures are scored low by
|
|
default.
|