From b7af3b650923df80e067e5545108d9cfe4c89f83 Mon Sep 17 00:00:00 2001 From: Michele Baldessari Date: Tue, 21 Jul 2020 10:07:23 +0200 Subject: [PATCH] Retry calling backup_cib() Currently if backup_cib() fails we just give up instantly. In most cases this is actually okay since there are no chances that the command succeeds in later retries. There are a number of cases though were retrying should be done. IHA FFU is one of them because the following might happen: 1) On the compute node we will try to set up a property when the remote is not yet connected Jul 21 04:59:45 compute-1 puppet-user[61009]: Debug: Executing: '/sbin/ip6tables-save' Jul 21 04:59:46 compute-1 pacemaker-remoted[42459]: warning: Cannot proxy IPC connection from uid 0 gid 0 to cib_rw because not connected to cluster Jul 21 04:59:46 compute-1 pacemaker-remoted[42459]: error: Error in connection setup (/dev/shm/qb-42459-61225-14-DzDuCR/qb): Remote I/O error (121) Jul 21 04:59:46 compute-1 puppet-user[61009]: Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha/Pacemaker::Property[compute-instanceha-role-node-property]/Pcmk_property[property-compute-1-compute-instanceha-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200721-61008-hz2sdw failed with code: 1 -> Error: unable to get cib 2) This happens because in IHA FFU after the control plane has been upgraded we upgrade the compute nodes one by one, without running any other commands on the control plane. So we need to keep retrying setting the property even if backup_cib() fails because eventually the core cluster *will* reconnect to the remote: Jul 21 05:01:22 compute-1 pacemaker-remoted[42459]: notice: Remote client connection accepted That is due to the fact that we cannot necessarily control at what point the core cluster retries the connection, so we should strive for retrying a few times no matter what. Tested this 4 times in a row successfully (before this patch we would fail quite often) Related-Bug: #1888398 Change-Id: I1b5c2ed35f83b6572db9e919a4af4cd46f0e98e8 --- lib/puppet/provider/pcmk_common.rb | 47 ++++++++++++++++++------------ 1 file changed, 28 insertions(+), 19 deletions(-) diff --git a/lib/puppet/provider/pcmk_common.rb b/lib/puppet/provider/pcmk_common.rb index d6a5db79..8848c291 100644 --- a/lib/puppet/provider/pcmk_common.rb +++ b/lib/puppet/provider/pcmk_common.rb @@ -113,29 +113,38 @@ def pcs(name, resource_name, cmd, tries=1, try_sleep=0, if name.start_with?("create") && verify_on_create return pcs_create_with_verify(name, resource_name, cmd, tries, try_sleep) end - max_tries = name.include?('show') ? 1 : tries + max_tries = tries max_tries.times do |try| - cib = backup_cib() - try_text = max_tries > 1 ? "try #{try+1}/#{max_tries}: " : '' - Puppet.debug("#{try_text}#{PCS_BIN} -f #{cib} #{cmd}") - pcs_out = `#{PCS_BIN} -f #{cib} #{cmd} 2>&1` - if name.include?('show') - delete_cib(cib) - # return output for good exit or false for failure. - return $?.exitstatus == 0 ? pcs_out : false - end - if $?.exitstatus == 0 - # If push_cib failed, we stay in the loop and keep trying - if push_cib(cib) == 0 - sleep post_success_sleep - return pcs_out + begin + try_text = max_tries > 1 ? "try #{try+1}/#{max_tries}: " : '' + cib = backup_cib() + Puppet.debug("#{try_text}#{PCS_BIN} -f #{cib} #{cmd}") + pcs_out = `#{PCS_BIN} -f #{cib} #{cmd} 2>&1` + if name.include?('show') + delete_cib(cib) + # return output for good exit or false for failure. + return $?.exitstatus == 0 ? pcs_out : false end + if $?.exitstatus == 0 + # If push_cib failed, we stay in the loop and keep trying + if push_cib(cib) == 0 + sleep post_success_sleep + return pcs_out + end + end + Puppet.debug("Error: #{pcs_out}") + rescue Puppet::Error + Puppet.debug("cib_backup failed. Retrying #{try_text}") end - Puppet.debug("Error: #{pcs_out}") if try == max_tries-1 - delete_cib(cib) - pcs_out_line = pcs_out.lines.first ? pcs_out.lines.first.chomp! : '' - raise Puppet::Error, "pcs -f #{cib} #{name} failed: #{pcs_out_line}" + # need to consider the case that pcs_out was always nil due to cib_backup() always failing + delete_cib(cib) if cib + if pcs_out == nil + pcs_out_line = '' + else + pcs_out_line = pcs_out.lines.first ? pcs_out.lines.first.chomp! : '' + end + raise Puppet::Error, "pcs -f #{cib} #{cmd} failed: #{pcs_out_line}. Too many tries" end if try_sleep > 0 Puppet.debug("Sleeping for #{try_sleep} seconds between tries")