Fix races in major-upgrade-pacemaker Step2
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status |grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965
This commit is contained in:
parent
89efa79599
commit
ad07a29f94
@ -68,20 +68,3 @@ if [[ -n $(is_bootstrap_node) ]]; then
|
||||
#TODO(marios):someone from sahara needs to check this:
|
||||
# sahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head
|
||||
fi
|
||||
|
||||
start_or_enable_service rabbitmq
|
||||
check_resource rabbitmq started 600
|
||||
start_or_enable_service openstack-cinder-volume
|
||||
check_resource openstack-cinder-volume started 600
|
||||
|
||||
|
||||
# Swift isn't controled by pacemaker
|
||||
systemctl_swift start
|
||||
|
||||
# We need to start the systemd services we explicitely stopped at step _1.sh
|
||||
# FIXME: Should we let puppet during the convergence step do the service enabling or
|
||||
# should we add it here?
|
||||
for service in $(services_to_migrate); do
|
||||
manage_systemd_service start "${service%%-clone}"
|
||||
check_resource_systemd "${service%%-clone}" started 600
|
||||
done
|
||||
|
22
extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh
Executable file
22
extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh
Executable file
@ -0,0 +1,22 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -eu
|
||||
|
||||
start_or_enable_service rabbitmq
|
||||
check_resource rabbitmq started 600
|
||||
start_or_enable_service redis
|
||||
check_resource redis started 600
|
||||
start_or_enable_service openstack-cinder-volume
|
||||
check_resource openstack-cinder-volume started 600
|
||||
|
||||
|
||||
# Swift isn't controled by pacemaker
|
||||
systemctl_swift start
|
||||
|
||||
# We need to start the systemd services we explicitely stopped at step _1.sh
|
||||
# FIXME: Should we let puppet during the convergence step do the service enabling or
|
||||
# should we add it here?
|
||||
for service in $(services_to_migrate); do
|
||||
manage_systemd_service start "${service%%-clone}"
|
||||
check_resource_systemd "${service%%-clone}" started 600
|
||||
done
|
@ -120,3 +120,22 @@ resources:
|
||||
config: {get_resource: ControllerPacemakerUpgradeConfig_Step2}
|
||||
input_values: {get_param: input_values}
|
||||
|
||||
ControllerPacemakerUpgradeConfig_Step3:
|
||||
type: OS::Heat::SoftwareConfig
|
||||
properties:
|
||||
group: script
|
||||
config:
|
||||
list_join:
|
||||
- ''
|
||||
- - get_file: pacemaker_common_functions.sh
|
||||
- get_file: major_upgrade_pacemaker_migrations.sh
|
||||
- get_file: major_upgrade_controller_pacemaker_3.sh
|
||||
|
||||
ControllerPacemakerUpgradeDeployment_Step3:
|
||||
type: OS::Heat::SoftwareDeploymentGroup
|
||||
depends_on: ControllerPacemakerUpgradeDeployment_Step2
|
||||
properties:
|
||||
servers: {get_param: [servers, Controller]}
|
||||
config: {get_resource: ControllerPacemakerUpgradeConfig_Step3}
|
||||
input_values: {get_param: input_values}
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user