From b71bcbf982e3b5585e4abe8770977bf6e999624b Mon Sep 17 00:00:00 2001 From: Michele Baldessari Date: Mon, 12 Apr 2021 14:22:58 +0200 Subject: [PATCH] Move stonith resource creation to step2 With the merging of the pcs on host patchset for train we are seeing a problem with FFUs on Instance HA environments. Preamble: Tripleo keeps the stonith-enabled cluster property set to false until the puppet step 5 With the pcs on host patchset the enablement happens still at step 5 but it gets triggered during tripleo_ha_wrapper deployment task of cinder-volume which tries to restart the cinder-volume service (during the leapp of the first controller) and this hangs forever because pacemaker is in the following transition: - stonith-fence_compute-fence-nova is configured - pacemaker wants to call stonith on for controller-0 (which is probably dumb, but it is unlikely we'll be able to change that in the right timeframe as it seems a potentially involved change in behaviour) - Any other action, like cinder-volume restart in this case, is stuck and the FFU fails. If we simply move the stonith resource creation (and change nothing else in the stonith-enabled property being set at step 5) to step 2, we fix this. Tested and with the injection of this puppet-tripleo review into the FFU queens->train upgrade on an IHA system, now the FFU passes. Also applied this patch to a Train based IHA deployment and verified that deployment, redeploy, minor update and scaleup all keep on working. Closes-Bug: #1923723 Change-Id: Ib3e2d9c93221dfc2e15974142f30e8c84e7afd63 --- manifests/profile/base/pacemaker.pp | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/manifests/profile/base/pacemaker.pp b/manifests/profile/base/pacemaker.pp index b07765216..187af8173 100644 --- a/manifests/profile/base/pacemaker.pp +++ b/manifests/profile/base/pacemaker.pp @@ -146,7 +146,14 @@ class tripleo::profile::base::pacemaker ( $pacemaker_master = false } + # enable_fencing guides the enablement of the stonith-enabled cluster-wide property + # enable_stonith_resources drives the creation of the stonith resources themselves and happens at + # step2. The reason for step2 is the following: + # During step1 the cluster is created (and also the pcmk remote resources in case of IHA) + # Since stonith resources are created on each node separately we need to have the guarantee that + # all cluster nodes + remote exist before creating stonith resources for them $enable_fencing = str2bool(hiera('enable_fencing', false)) and $step >= 5 + $enable_stonith_resources = str2bool(hiera('enable_fencing', false)) and $step >= 2 if $step >= 1 { if (hiera('pacemaker_short_node_names_override', undef)) { @@ -212,7 +219,7 @@ class tripleo::profile::base::pacemaker ( } Class['pacemaker::stonith'] -> Exec<|tag == 'pacemaker-scaleup'|> } - if $enable_fencing { + if $enable_stonith_resources { include tripleo::fencing # enable stonith after all Pacemaker resources have been created