Merge "Maintenance heartbeat failure handling over active controller service activation"
This commit is contained in:
commit
baff361d45
@ -0,0 +1,50 @@
|
||||
.. _handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717:
|
||||
|
||||
=============================================================================
|
||||
Handle Maintenance Heartbeat Failure for Active Controller Service Activation
|
||||
=============================================================================
|
||||
|
||||
Maintenance is started by Service Management (along with other active
|
||||
controller services) based on one of the following 3 events:
|
||||
|
||||
- on initial controller activity startup, or
|
||||
|
||||
- on a controlled or uncontrolled controller |SWACT|, or
|
||||
|
||||
- on active controller selection following a double controller reboot/power
|
||||
outage; i.e. |DOR|
|
||||
|
||||
In such events, Maintenance process startup queries System Inventory for a list
|
||||
of provisioned hosts along with their configuration and state information.
|
||||
|
||||
Hosts that are found to be in the unlocked/enabled state are expected to
|
||||
service Maintenance heartbeat.
|
||||
|
||||
However, the uptime on the active controller can impact how quickly Maintenance
|
||||
reacts to unlocked-enabled hosts that fail heartbeat following controller
|
||||
services activation.
|
||||
|
||||
If the active controller reboots or loses power, then the standby controller
|
||||
takes over by way of an uncontrolled |SWACT|.
|
||||
|
||||
**Greater than 15 minute uptime**: When maintenance starts on a controller whose
|
||||
uptime is greater than 15 minutes, any host found to be in the unlocked/enabled
|
||||
state and not servicing heartbeat will be given a 5 second grace period before
|
||||
Maintenance declares the node failed and puts it into **Graceful Recovery**.
|
||||
|
||||
**Graceful Recovery** is a maintenance heartbeat failure state capable of avoiding
|
||||
a second reboot if the host was found to have already rebooted upon heartbeat
|
||||
loss recovery.
|
||||
|
||||
If both controllers reboot or lose power, then Service Management will start
|
||||
services on the first healthy controller following the outage.
|
||||
|
||||
**Less than 15 minute uptime**: When maintenance starts on a controller whose
|
||||
uptime is less than 15 minutes, it assumes the system is in |DOR| mode.
|
||||
Maintenance is more tolerant of unlocked/enabled hosts that are not immediately
|
||||
servicing heartbeat following maintenance process startup in |DOR| mode.
|
||||
Instead of failing a node after 5 seconds, it waits up to 10 minutes to give
|
||||
servers a longer grace period to recover, knowing that power outage recovery
|
||||
time can vary from server to server.
|
||||
|
||||
|
@ -277,6 +277,7 @@ Customize host life cycles
|
||||
customizing_the_host_life_cycles/adjusting-the-host-heartbeat-interval-and-heartbeat-response-thresholds
|
||||
customizing_the_host_life_cycles/configuring-heartbeat-failure-action
|
||||
customizing_the_host_life_cycles/configuring-multi-node-failure-avoidance
|
||||
customizing_the_host_life_cycles/handling-maintenance-heartbeat-failure-for-active-controller-service-activation-70fb51663717
|
||||
|
||||
--------------------
|
||||
Node inventory tasks
|
||||
|
@ -34,6 +34,7 @@
|
||||
.. |CVE| replace:: :abbr:`CVE (Common Vulnerabilities and Exposures)`
|
||||
.. |DAD| replace:: :abbr:`DAD (Duplicate Address Detection)`
|
||||
.. |DC| replace:: :abbr:`DC (Distributed Cloud)`
|
||||
.. |DOR| replace:: :abbr:`DOR (Dead Office Recovery)`
|
||||
.. |DHCP| replace:: :abbr:`DHCP (Dynamic Host Configuration Protocol)`
|
||||
.. |DMA| replace:: :abbr:`DMA (Direct Memory Access)`
|
||||
.. |DNS| replace:: :abbr:`DNS (Domain Name System)`
|
||||
@ -123,6 +124,7 @@
|
||||
.. |SSH| replace:: :abbr:`SSH (Secure Shell)`
|
||||
.. |SSL| replace:: :abbr:`SSL (Secure Socket Layer)`
|
||||
.. |STP| replace:: :abbr:`STP (Spanning Tree Protocol)`
|
||||
.. |SWACT| replace:: :abbr:`SWACT (SWitch ACTivity)`
|
||||
.. |TCP| replace:: :abbr:`TCP (Transition Control Protocol)`
|
||||
.. |TFTP| replace:: :abbr:`TFTP (Trivial File Transfer Protocol)`
|
||||
.. |TLS| replace:: :abbr:`TLS (Transport Layer Security)`
|
||||
|
Loading…
Reference in New Issue
Block a user