2ab8364649
Currently, if heartbeat fails, we reschedule it after 5 seconds. This is fine for the first retry, but it can cause a thundering herd problem when a lot of nodes fail to heartbeat at once. This change adds jitter to the minimum wait of 5 seconds. The jitter is not applied for forced heartbeats: they still have a minimum wait of exactly 5 seconds from the last heartbeat. The code is re-ordered to move the interval calculation to one place. Bonus: correctly logging the next interval. The unit tests have been rewritten to test the heartbeat process step by step and not rely on the exact sequence of the calls. Closes-Bug: #2038438 Change-Id: I4c4207b15fb3d48b55e340b7b3b54af833f92cb5
8 lines
258 B
YAML
8 lines
258 B
YAML
---
|
|
fixes:
|
|
- |
|
|
Adds random jitter to retried heartbeats after Ironic returns an error.
|
|
Previously, heartbeats would be retried after 5 seconds, potentially
|
|
causing a thundering herd problem if many nodes fail to heartbeat at
|
|
the same time.
|