Retry calling backup_cib()

Currently if backup_cib() fails we just give up instantly. In most cases this is actually okay since there are no chances that the command succeeds in later retries. There are a number of cases though were retrying should be done. IHA FFU is one of them because the following might happen: 1) On the compute node we will try to set up a property when the remote is not yet connected Jul 21 04:59:45 compute-1 puppet-user[61009]: Debug: Executing: '/sbin/ip6tables-save' Jul 21 04:59:46 compute-1 pacemaker-remoted[42459]: warning: Cannot proxy IPC connection from uid 0 gid 0 to cib_rw because not connected to cluster Jul 21 04:59:46 compute-1 pacemaker-remoted[42459]: error: Error in connection setup (/dev/shm/qb-42459-61225-14-DzDuCR/qb): Remote I/O error (121) Jul 21 04:59:46 compute-1 puppet-user[61009]: Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha/Pacemaker::Property[compute-instanceha-role-node-property]/Pcmk_property[property-compute-1-compute-instanceha-role]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20200721-61008-hz2sdw failed with code: 1 -> Error: unable to get cib 2) This happens because in IHA FFU after the control plane has been upgraded we upgrade the compute nodes one by one, without running any other commands on the control plane. So we need to keep retrying setting the property even if backup_cib() fails because eventually the core cluster *will* reconnect to the remote: Jul 21 05:01:22 compute-1 pacemaker-remoted[42459]: notice: Remote client connection accepted That is due to the fact that we cannot necessarily control at what point the core cluster retries the connection, so we should strive for retrying a few times no matter what. Tested this 4 times in a row successfully (before this patch we would fail quite often) Related-Bug: #1888398 Change-Id: I1b5c2ed35f83b6572db9e919a4af4cd46f0e98e8
2020-07-21 10:07:23 +02:00 · 2020-07-21 10:07:23 +02:00 · b7af3b6509
parent 1bb2fe70b1
commit b7af3b6509
1 changed files with 28 additions and 19 deletions
--- a/lib/puppet/provider/pcmk_common.rb
+++ b/lib/puppet/provider/pcmk_common.rb
@ -113,29 +113,38 @@ def pcs(name, resource_name, cmd, tries=1, try_sleep=0,
  if name.start_with?("create") && verify_on_create
    return pcs_create_with_verify(name, resource_name, cmd, tries, try_sleep)
  end
-  max_tries = name.include?('show') ? 1 : tries
+  max_tries = tries
  max_tries.times do |try|
-    cib = backup_cib()
-    try_text = max_tries > 1 ? "try #{try+1}/#{max_tries}: " : ''
-    Puppet.debug("#{try_text}#{PCS_BIN} -f #{cib} #{cmd}")
-    pcs_out = `#{PCS_BIN} -f #{cib} #{cmd} 2>&1`
-    if name.include?('show')
-      delete_cib(cib)
-      # return output for good exit or false for failure.
-      return $?.exitstatus == 0 ? pcs_out : false
-    end
-    if $?.exitstatus == 0
-      # If push_cib failed, we stay in the loop and keep trying
-      if push_cib(cib) == 0
-        sleep post_success_sleep
-        return pcs_out
+    begin
+      try_text = max_tries > 1 ? "try #{try+1}/#{max_tries}: " : ''
+      cib = backup_cib()
+      Puppet.debug("#{try_text}#{PCS_BIN} -f #{cib} #{cmd}")
+      pcs_out = `#{PCS_BIN} -f #{cib} #{cmd} 2>&1`
+      if name.include?('show')
+        delete_cib(cib)
+        # return output for good exit or false for failure.
+        return $?.exitstatus == 0 ? pcs_out : false
      end
+      if $?.exitstatus == 0
+        # If push_cib failed, we stay in the loop and keep trying
+        if push_cib(cib) == 0
+          sleep post_success_sleep
+          return pcs_out
+        end
+      end
+      Puppet.debug("Error: #{pcs_out}")
+    rescue Puppet::Error
+      Puppet.debug("cib_backup failed. Retrying #{try_text}")
    end
-    Puppet.debug("Error: #{pcs_out}")
    if try == max_tries-1
-      delete_cib(cib)
-      pcs_out_line = pcs_out.lines.first ? pcs_out.lines.first.chomp! : ''
-      raise Puppet::Error, "pcs -f #{cib} #{name} failed: #{pcs_out_line}"
+      # need to consider the case that pcs_out was always nil due to cib_backup() always failing
+      delete_cib(cib) if cib
+      if pcs_out == nil
+        pcs_out_line = ''
+      else
+        pcs_out_line = pcs_out.lines.first ? pcs_out.lines.first.chomp! : ''
+      end
+      raise Puppet::Error, "pcs -f #{cib} #{cmd} failed: #{pcs_out_line}. Too many tries"
    end
    if try_sleep > 0
      Puppet.debug("Sleeping for #{try_sleep} seconds between tries")