Merge "Maintenance heartbeat failure handling over active controller service activation"

2021-12-10 23:43:33 +00:00
parent 14597345a6 d5d273996d
commit baff361d45
3 changed files with 53 additions and 0 deletions
--- a/doc/source/node_management/kubernetes/customizing_the_host_life_cycles/handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717.rst
+++ b/doc/source/node_management/kubernetes/customizing_the_host_life_cycles/handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717.rst
@@ -0,0 +1,50 @@
+.. _handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717:
+
+=============================================================================
+Handle Maintenance Heartbeat Failure for Active Controller Service Activation
+=============================================================================
+
+Maintenance is started by Service Management (along with other active
+controller services) based on one of the following 3 events:
+
+-   on initial controller activity startup, or
+
+-   on a controlled or uncontrolled controller |SWACT|, or
+
+-   on active controller selection following a double controller reboot/power
+    outage; i.e. |DOR|
+
+In such events, Maintenance process startup queries System Inventory for a list
+of provisioned hosts along with their configuration and state information.
+
+Hosts that are found to be in the unlocked/enabled state are expected to
+service Maintenance heartbeat.
+
+However, the uptime on the active controller can impact how quickly Maintenance
+reacts to unlocked-enabled hosts that fail heartbeat following controller
+services activation.
+
+If the active controller reboots or loses power, then the standby controller
+takes over by way of an uncontrolled |SWACT|.
+
+**Greater than 15 minute uptime**: When maintenance starts on a controller whose
+uptime is greater than 15 minutes, any host found to be in the unlocked/enabled
+state and not servicing heartbeat will be given a 5 second grace period before
+Maintenance declares the node failed and puts it into **Graceful Recovery**.
+
+**Graceful Recovery** is a maintenance heartbeat failure state capable of avoiding
+a second reboot if the host was found to have already rebooted upon heartbeat
+loss recovery.
+
+If both controllers reboot or lose power, then Service Management will start
+services on the first healthy controller following the outage.
+
+**Less than 15 minute uptime**: When maintenance starts on a controller whose
+uptime is less than 15 minutes, it assumes the system is in |DOR| mode.
+Maintenance is more tolerant of unlocked/enabled hosts that are not immediately
+servicing heartbeat following maintenance process startup in |DOR| mode.
+Instead of failing a node after 5 seconds, it waits up to 10 minutes to give
+servers a longer grace period to recover, knowing that power outage recovery
+time can vary from server to server.
+
+
--- a/doc/source/node_management/kubernetes/index.rst
+++ b/doc/source/node_management/kubernetes/index.rst
@@ -277,6 +277,7 @@ Customize host life cycles
   customizing_the_host_life_cycles/adjusting-the-host-heartbeat-interval-and-heartbeat-response-thresholds
   customizing_the_host_life_cycles/configuring-heartbeat-failure-action
   customizing_the_host_life_cycles/configuring-multi-node-failure-avoidance
+   customizing_the_host_life_cycles/handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717

 --------------------
 Node inventory tasks
--- a/doc/source/shared/abbrevs.txt
+++ b/doc/source/shared/abbrevs.txt
@@ -34,6 +34,7 @@
 .. |CVE| replace:: :abbr:`CVE (Common Vulnerabilities and Exposures)`
 .. |DAD| replace:: :abbr:`DAD (Duplicate Address Detection)`
 .. |DC| replace:: :abbr:`DC (Distributed Cloud)`
+.. |DOR| replace:: :abbr:`DOR (Dead Office Recovery)`
 .. |DHCP| replace:: :abbr:`DHCP (Dynamic Host Configuration Protocol)`
 .. |DMA| replace:: :abbr:`DMA (Direct Memory Access)`
 .. |DNS| replace:: :abbr:`DNS (Domain Name System)`
@@ -123,6 +124,7 @@
 .. |SSH| replace:: :abbr:`SSH (Secure Shell)`
 .. |SSL| replace:: :abbr:`SSL (Secure Socket Layer)`
 .. |STP| replace:: :abbr:`STP (Spanning Tree Protocol)`
+.. |SWACT| replace:: :abbr:`SWACT (SWitch ACTivity)`
 .. |TCP| replace:: :abbr:`TCP (Transition Control Protocol)`
 .. |TFTP| replace:: :abbr:`TFTP (Trivial File Transfer Protocol)`
 .. |TLS| replace:: :abbr:`TLS (Transport Layer Security)`