If you gracefully restart a node with pacemaker on it, the following can happen:
1) docker service gets stopped first
2) pacemaker gets shutdown
3) pacemaker tries to shutdown the bundles but fails due to 1)
This can make it so that after the reboot, because shutting down the
services failed, two scenarios can take place:
A) The node gets fenced (when stonith is configured) because it failed to stop a resource
B) The state of the resource might be saved as Stopped and when the node
comes back up (if multiple nodes were rebooted at the same time) the CIB
might have Stopped as the target state for the resource.
In the case of B) we will see something like the following:
Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
Full list of resources:
Docker container set: rabbitmq-bundle [192.168.0.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest]
rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-0
rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-0
rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-0
Docker container set: galera-bundle [192.168.0.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
galera-bundle-0 (ocf::heartbeat:galera): Stopped overcloud-controller-0
galera-bundle-1 (ocf::heartbeat:galera): Stopped overcloud-controller-0
galera-bundle-2 (ocf::heartbeat:galera): Stopped overcloud-controller-0
Docker container set: redis-bundle [192.168.0.1:8787/rhosp12/openstack-redis-docker:pcmklatest]
redis-bundle-0 (ocf::heartbeat:redis): Stopped overcloud-controller-0
redis-bundle-1 (ocf::heartbeat:redis): Stopped overcloud-controller-0
redis-bundle-2 (ocf::heartbeat:redis): Stopped overcloud-controller-0
ip-192.168.0.12 (ocf::heartbeat:IPaddr2): Stopped
ip-10.19.184.160 (ocf::heartbeat:IPaddr2): Stopped
ip-10.19.104.14 (ocf::heartbeat:IPaddr2): Stopped
ip-10.19.104.19 (ocf::heartbeat:IPaddr2): Stopped
ip-10.19.105.11 (ocf::heartbeat:IPaddr2): Stopped
ip-192.168.200.15 (ocf::heartbeat:IPaddr2): Stopped
Docker container set: haproxy-bundle [192.168.0.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest]
haproxy-bundle-docker-0 (ocf::heartbeat:docker): FAILED (blocked)[ overcloud-controller-0 overcloud-controller-2 overcloud-controller-1 ]
haproxy-bundle-docker-1 (ocf::heartbeat:docker): FAILED (blocked)[ overcloud-controller-0 overcloud-controller-2 overcloud-controller-1 ]
haproxy-bundle-docker-2 (ocf::heartbeat:docker): FAILED (blocked)[ overcloud-controller-0 overcloud-controller-2 overcloud-controller-1 ]
openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0
Failed Actions:
* rabbitmq-bundle-docker-0_stop_0 on overcloud-controller-0 'unknown error' (1): call=93, status=Timed Out, exitreason='none',
last-rc-change='Fri Nov 17 13:55:35 2017', queued=0ms, exec=20023ms
* rabbitmq-bundle-docker-1_stop_0 on overcloud-controller-0 'unknown error' (1): call=94, status=Timed Out, exitreason='none',
last-rc-change='Fri Nov 17 13:55:35 2017', queued=0ms, exec=20037ms
* galera-bundle-docker-0_stop_0 on overcloud-controller-0 'unknown error' (1): call=96, status=Timed Out, exitreason='none',
last-rc-change='Fri Nov 17 13:55:35 2017', queued=0ms, exec=20035ms
We fix this by adding the docker service to
resource-agents-deps.target.wants which is the recommended method to
make sure non pacemaker managed resources come up before pacemaker
during a start and get stopped after pacemaker's service stop:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-nonpacemakerstartup-haar
We conditionalize this change when docker is enabled and we also
make sure that we make the change only after the docker package
is installed, in order to cover split stack deployments as well.
With this change we were able to restart nodes without
observing any timeouts during stop or stopped resources
at startup.
Co-Authored-By: Damien Ciabrini <dciabrin@redhat.com>
Change-Id: I6a4dc3d4d4818f15e9b7e68da3eb07e54b0289fa
Closes-Bug: #1733348
72 lines
2.3 KiB
Puppet
72 lines
2.3 KiB
Puppet
# Copyright 2016 Red Hat, Inc.
|
|
#
|
|
# Licensed under the Apache License, Version 2.0 (the "License"); you may
|
|
# not use this file except in compliance with the License. You may obtain
|
|
# a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
|
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
|
# License for the specific language governing permissions and limitations
|
|
# under the License.
|
|
#
|
|
# == Class: tripleo::profile::base::pacemaker_remote
|
|
#
|
|
# Pacemaker remote profile for tripleo
|
|
#
|
|
# === Parameters
|
|
#
|
|
# [*remote_authkey*]
|
|
# Authkey for pacemaker remote nodes
|
|
# Defaults to unset
|
|
#
|
|
# [*pcs_tries*]
|
|
# (Optional) The number of times pcs commands should be retried.
|
|
# Defaults to hiera('pcs_tries', 20)
|
|
#
|
|
# [*enable_fencing*]
|
|
# (Optional) Whether or not to manage stonith devices for nodes
|
|
# Defaults to hiera('enable_fencing', false)
|
|
#
|
|
# [*step*]
|
|
# (Optional) The current step in deployment. See tripleo-heat-templates
|
|
# for more details.
|
|
# Defaults to hiera('step')
|
|
#
|
|
class tripleo::profile::base::pacemaker_remote (
|
|
$remote_authkey,
|
|
$pcs_tries = hiera('pcs_tries', 20),
|
|
$enable_fencing = hiera('enable_fencing', false),
|
|
$step = Integer(hiera('step')),
|
|
) {
|
|
class { '::pacemaker::remote':
|
|
remote_authkey => $remote_authkey,
|
|
}
|
|
if str2bool(hiera('docker_enabled', false)) {
|
|
include ::systemd::systemctl::daemon_reload
|
|
|
|
Package<| name == 'docker' |>
|
|
-> file { '/etc/systemd/system/resource-agents-deps.target.wants':
|
|
ensure => directory,
|
|
}
|
|
-> systemd::unit_file { 'docker.service':
|
|
path => '/etc/systemd/system/resource-agents-deps.target.wants',
|
|
target => '/usr/lib/systemd/system/docker.service',
|
|
before => Class['pacemaker::remote'],
|
|
}
|
|
~> Class['systemd::systemctl::daemon_reload']
|
|
}
|
|
$enable_fencing_real = str2bool($enable_fencing) and $step >= 5
|
|
|
|
if $enable_fencing_real {
|
|
include ::tripleo::fencing
|
|
|
|
# enable stonith after all Pacemaker resources have been created
|
|
Pcmk_resource<||> -> Class['tripleo::fencing']
|
|
Pcmk_constraint<||> -> Class['tripleo::fencing']
|
|
Exec <| tag == 'pacemaker_constraint' |> -> Class['tripleo::fencing']
|
|
}
|
|
}
|