Retry calling backup_cib()

Currently if backup_cib() fails we just give up instantly.
In most cases this is actually okay since there are no chances
that the command succeeds in later retries. There are a number of cases
though were retrying should be done. IHA FFU is one of them because the
following might happen:
1) On the compute node we will try to set up a property when the remote
   is not yet connected
Jul 21 04:59:45 compute-1 puppet-user[61009]: Debug: Executing: '/sbin/ip6tables-save'
Jul 21 04:59:46 compute-1 pacemaker-remoted[42459]: warning: Cannot proxy IPC connection from uid 0 gid 0 to cib_rw because not connected to cluster
Jul 21 04:59:46 compute-1 pacemaker-remoted[42459]: error: Error in connection setup (/dev/shm/qb-42459-61225-14-DzDuCR/qb): Remote I/O error (121)
Jul 21 04:59:46 compute-1 puppet-user[61009]: Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha/Pacemaker::Property[compute-instanceha-role-node-property]/Pcmk_property[property-compute-1-compute-instanceha-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200721-61008-hz2sdw failed with code: 1 -> Error: unable to get cib

2) This happens because in IHA FFU after the control plane has been
   upgraded we upgrade the compute nodes one by one, without running
   any other commands on the control plane. So we need to keep retrying
   setting the property even if backup_cib() fails because eventually
   the core cluster *will* reconnect to the remote:
Jul 21 05:01:22 compute-1 pacemaker-remoted[42459]: notice: Remote client connection accepted

That is due to the fact that we cannot necessarily control at what point
the core cluster retries the connection, so we should strive for
retrying a few times no matter what.

Tested this 4 times in a row successfully (before this patch we would
fail quite often)

Related-Bug: #1888398

Change-Id: I1b5c2ed35f83b6572db9e919a4af4cd46f0e98e8
This commit is contained in:
Michele Baldessari 2020-07-21 10:07:23 +02:00
parent 1bb2fe70b1
commit b7af3b6509
1 changed files with 28 additions and 19 deletions

View File

@ -113,29 +113,38 @@ def pcs(name, resource_name, cmd, tries=1, try_sleep=0,
if name.start_with?("create") && verify_on_create if name.start_with?("create") && verify_on_create
return pcs_create_with_verify(name, resource_name, cmd, tries, try_sleep) return pcs_create_with_verify(name, resource_name, cmd, tries, try_sleep)
end end
max_tries = name.include?('show') ? 1 : tries max_tries = tries
max_tries.times do |try| max_tries.times do |try|
cib = backup_cib() begin
try_text = max_tries > 1 ? "try #{try+1}/#{max_tries}: " : '' try_text = max_tries > 1 ? "try #{try+1}/#{max_tries}: " : ''
Puppet.debug("#{try_text}#{PCS_BIN} -f #{cib} #{cmd}") cib = backup_cib()
pcs_out = `#{PCS_BIN} -f #{cib} #{cmd} 2>&1` Puppet.debug("#{try_text}#{PCS_BIN} -f #{cib} #{cmd}")
if name.include?('show') pcs_out = `#{PCS_BIN} -f #{cib} #{cmd} 2>&1`
delete_cib(cib) if name.include?('show')
# return output for good exit or false for failure. delete_cib(cib)
return $?.exitstatus == 0 ? pcs_out : false # return output for good exit or false for failure.
end return $?.exitstatus == 0 ? pcs_out : false
if $?.exitstatus == 0
# If push_cib failed, we stay in the loop and keep trying
if push_cib(cib) == 0
sleep post_success_sleep
return pcs_out
end end
if $?.exitstatus == 0
# If push_cib failed, we stay in the loop and keep trying
if push_cib(cib) == 0
sleep post_success_sleep
return pcs_out
end
end
Puppet.debug("Error: #{pcs_out}")
rescue Puppet::Error
Puppet.debug("cib_backup failed. Retrying #{try_text}")
end end
Puppet.debug("Error: #{pcs_out}")
if try == max_tries-1 if try == max_tries-1
delete_cib(cib) # need to consider the case that pcs_out was always nil due to cib_backup() always failing
pcs_out_line = pcs_out.lines.first ? pcs_out.lines.first.chomp! : '' delete_cib(cib) if cib
raise Puppet::Error, "pcs -f #{cib} #{name} failed: #{pcs_out_line}" if pcs_out == nil
pcs_out_line = ''
else
pcs_out_line = pcs_out.lines.first ? pcs_out.lines.first.chomp! : ''
end
raise Puppet::Error, "pcs -f #{cib} #{cmd} failed: #{pcs_out_line}. Too many tries"
end end
if try_sleep > 0 if try_sleep > 0
Puppet.debug("Sleeping for #{try_sleep} seconds between tries") Puppet.debug("Sleeping for #{try_sleep} seconds between tries")