Browse Source

Fix neutron-ha-tool for active/passive usage

The neutron-ha-tool Pacemaker resource primitive is only intended to be
run on a single node at a time, i.e. in active/passive mode, rather than
as a clone.  However until now, the RA didn't change behaviour depending
on whether it was supposed to be active on the current node.  So if
Pacemaker did a probe on a node where it was not expecting it to be
active, the monitor action would typically return OCF_SUCCESS, causing
messages from pengine like:

  error: Resource neutron-ha-tool (ocf::neutron-ha-tool) is active on 2 nodes attempting recovery
  warning: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.

and then Pacemaker could attempt unnecessary recovery according to the
value of the cluster-wide "multiple-active" option, which defaults to
"stop-start".  This would stop the resource everywhere (which is a
noop), and then start it on one node, resulting in unnecessary cluster
transitions and unnecessary runs of this RA's "start" action.

To avoid this, we introduce a state file to keep track of whether it's
active on the current node, and if so, skip the l3-agent check and
always return OCF_NOT_RUNNING.  This is the same technique already used
by NovaEvacuate.

Change-Id: I459e49d27802552ef5424d290ef3fca51640723b
Closes-Bug: #1555711
Signed-off-by: Adam Spiers <aspiers@suse.com>
Adam Spiers 3 years ago
parent
commit
34447f8fa8
1 changed files with 35 additions and 1 deletions
  1. 35
    1
      ocf/neutron-ha-tool

+ 35
- 1
ocf/neutron-ha-tool View File

@@ -192,6 +192,26 @@ neutron_ha_tool_status() {
192 192
 }
193 193
 
194 194
 neutron_ha_tool_monitor() {
195
+    if ! [ -e "$statefile" ]; then
196
+        # neutron-ha-tool is run on a single node at a time, i.e. in
197
+        # active/passive mode.  So we use this state file to keep
198
+        # track of whether it's active on the current node, and if
199
+        # Pacemaker does a probe on a node where it's not active, we
200
+        # skip the l3-agent check and always return OCF_NOT_RUNNING,
201
+        # otherwise we'd get messages from pengine like:
202
+        #
203
+        #   error: Resource neutron-ha-tool (ocf::neutron-ha-tool) is active on
204
+        #       2 nodes attempting recovery
205
+        #   warning: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active
206
+        #       for more information.
207
+        #
208
+        # and Pacemaker could attempt unnecessary recovery according to the
209
+        # value of the cluster-wide "multiple-active" option.
210
+        ocf_log debug "neutron-ha-tool not currently active on this node; " \
211
+            "skipping l3-agent check"
212
+        return $OCF_NOT_RUNNING
213
+    fi
214
+
195 215
     INSECURE=""
196 216
     if ocf_is_true $OCF_RESKEY_os_insecure; then
197 217
         INSECURE="--insecure"
@@ -210,6 +230,12 @@ neutron_ha_tool_monitor() {
210 230
 }
211 231
 
212 232
 neutron_ha_tool_start() {
233
+    touch "$statefile"
234
+    if ! [ -e "$statefile" ]; then
235
+        ocf_log err "Failed to create $statefile - aborting!"
236
+        return $OCF_ERR_GENERIC
237
+    fi
238
+
213 239
     INSECURE=""
214 240
     if ocf_is_true $OCF_RESKEY_os_insecure; then
215 241
         INSECURE="--insecure"
@@ -238,7 +264,13 @@ neutron_ha_tool_start() {
238 264
 }
239 265
 
240 266
 neutron_ha_tool_stop() {
241
-    # This is a noop
267
+    rm -f "$statefile"
268
+    if [ -e "$statefile" ]; then
269
+        ocf_log err "Uh-oh - failed to remove $statefile!"
270
+        # If we can't even remove a file in tmpfs (/run), something
271
+        # is *really* badly wrong, so fence the node.
272
+        return $OCF_ERR_GENERIC
273
+    fi
242 274
     return $OCF_SUCCESS
243 275
 }
244 276
 
@@ -268,6 +300,8 @@ if [ -n "$OCF_RESKEY_os_cacert" ]; then
268 300
     export OS_CACERT=$OCF_RESKEY_os_cacert
269 301
 fi
270 302
 
303
+statefile="${HA_RSCTMP}/${OCF_RESOURCE_INSTANCE}.active"
304
+
271 305
 # What kind of method was invoked?
272 306
 case "$1" in
273 307
     start)

Loading…
Cancel
Save