Browse Source

Retry calling backup_cib()

Currently if backup_cib() fails we just give up instantly.
In most cases this is actually okay since there are no chances
that the command succeeds in later retries. There are a number of cases
though were retrying should be done. IHA FFU is one of them because the
following might happen:
1) On the compute node we will try to set up a property when the remote
   is not yet connected
Jul 21 04:59:45 compute-1 puppet-user[61009]: Debug: Executing: '/sbin/ip6tables-save'
Jul 21 04:59:46 compute-1 pacemaker-remoted[42459]: warning: Cannot proxy IPC connection from uid 0 gid 0 to cib_rw because not connected to cluster
Jul 21 04:59:46 compute-1 pacemaker-remoted[42459]: error: Error in connection setup (/dev/shm/qb-42459-61225-14-DzDuCR/qb): Remote I/O error (121)
Jul 21 04:59:46 compute-1 puppet-user[61009]: Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha/Pacemaker::Property[compute-instanceha-role-node-property]/Pcmk_property[property-compute-1-compute-instanceha-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200721-61008-hz2sdw failed with code: 1 -> Error: unable to get cib

2) This happens because in IHA FFU after the control plane has been
   upgraded we upgrade the compute nodes one by one, without running
   any other commands on the control plane. So we need to keep retrying
   setting the property even if backup_cib() fails because eventually
   the core cluster *will* reconnect to the remote:
Jul 21 05:01:22 compute-1 pacemaker-remoted[42459]: notice: Remote client connection accepted

That is due to the fact that we cannot necessarily control at what point
the core cluster retries the connection, so we should strive for
retrying a few times no matter what.

Tested this 4 times in a row successfully (before this patch we would
fail quite often)

Related-Bug: #1888398

Change-Id: I1b5c2ed35f83b6572db9e919a4af4cd46f0e98e8
changes/93/742093/4
Michele Baldessari 3 weeks ago
parent
commit
b7af3b6509
1 changed files with 28 additions and 19 deletions
  1. +28
    -19
      lib/puppet/provider/pcmk_common.rb

+ 28
- 19
lib/puppet/provider/pcmk_common.rb View File

@@ -113,29 +113,38 @@ def pcs(name, resource_name, cmd, tries=1, try_sleep=0,
if name.start_with?("create") && verify_on_create
return pcs_create_with_verify(name, resource_name, cmd, tries, try_sleep)
end
max_tries = name.include?('show') ? 1 : tries
max_tries = tries
max_tries.times do |try|
cib = backup_cib()
try_text = max_tries > 1 ? "try #{try+1}/#{max_tries}: " : ''
Puppet.debug("#{try_text}#{PCS_BIN} -f #{cib} #{cmd}")
pcs_out = `#{PCS_BIN} -f #{cib} #{cmd} 2>&1`
if name.include?('show')
delete_cib(cib)
# return output for good exit or false for failure.
return $?.exitstatus == 0 ? pcs_out : false
end
if $?.exitstatus == 0
# If push_cib failed, we stay in the loop and keep trying
if push_cib(cib) == 0
sleep post_success_sleep
return pcs_out
begin
try_text = max_tries > 1 ? "try #{try+1}/#{max_tries}: " : ''
cib = backup_cib()
Puppet.debug("#{try_text}#{PCS_BIN} -f #{cib} #{cmd}")
pcs_out = `#{PCS_BIN} -f #{cib} #{cmd} 2>&1`
if name.include?('show')
delete_cib(cib)
# return output for good exit or false for failure.
return $?.exitstatus == 0 ? pcs_out : false
end
if $?.exitstatus == 0
# If push_cib failed, we stay in the loop and keep trying
if push_cib(cib) == 0
sleep post_success_sleep
return pcs_out
end
end
Puppet.debug("Error: #{pcs_out}")
rescue Puppet::Error
Puppet.debug("cib_backup failed. Retrying #{try_text}")
end
Puppet.debug("Error: #{pcs_out}")
if try == max_tries-1
delete_cib(cib)
pcs_out_line = pcs_out.lines.first ? pcs_out.lines.first.chomp! : ''
raise Puppet::Error, "pcs -f #{cib} #{name} failed: #{pcs_out_line}"
# need to consider the case that pcs_out was always nil due to cib_backup() always failing
delete_cib(cib) if cib
if pcs_out == nil
pcs_out_line = ''
else
pcs_out_line = pcs_out.lines.first ? pcs_out.lines.first.chomp! : ''
end
raise Puppet::Error, "pcs -f #{cib} #{cmd} failed: #{pcs_out_line}. Too many tries"
end
if try_sleep > 0
Puppet.debug("Sleeping for #{try_sleep} seconds between tries")


Loading…
Cancel
Save