Merge "Pausing Principal units and hacluster without sending false alerts"
This commit is contained in:
commit
85afbf9397
453
specs/yoga/approved/pausing-charms-hacluster-no-false-alerts.rst
Normal file
453
specs/yoga/approved/pausing-charms-hacluster-no-false-alerts.rst
Normal file
@ -0,0 +1,453 @@
|
||||
..
|
||||
Copyright 2021 Canonical Ltd
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
This template should be in ReSTructured text. Please do not delete
|
||||
any of the sections in this template. If you have nothing to say
|
||||
for a whole section, just write: "None". For help with syntax, see
|
||||
http://sphinx-doc.org/rest.html To test out your formatting, see
|
||||
http://www.tele3.cz/jbar/rest/rest.html
|
||||
|
||||
======================================================================
|
||||
Pausing Charms with subordinate hacluster without sending false alerts
|
||||
======================================================================
|
||||
|
||||
Overall, the goal is to leave “warning” alerts instead of “critical” that
|
||||
will help a human operator understand that all services are not completely
|
||||
healthy while reducing the criticality due to an on-going operation. Nrpe
|
||||
checks will be reconfigured once services under a maintenance operation
|
||||
are set back to normal (resume).
|
||||
|
||||
The following logic will be applied when pausing/resuming a unit:
|
||||
|
||||
- Pausing a principal unit, pauses the subordinate hacluster;
|
||||
- Resuming a principal unit, resumes the subordinate hacluster;
|
||||
- Pausing a hacluster unit, pauses the principal unit;
|
||||
- Resuming a hacluster unit, resumes the principal unit;
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
We need to stop sending false alerts when a hacluster subordinate of an
|
||||
Openstack charm unit is paused or when the principal unit is also paused
|
||||
for maintenance. This may help operations to receive more effective alerts.
|
||||
|
||||
There are several charms that use hacluster and NRPE that may benefit from
|
||||
this:
|
||||
|
||||
- charm-ceilometer
|
||||
- charm-ceph-radosgw
|
||||
- charm-designate
|
||||
- charm-keystone
|
||||
- charm-neutron-api
|
||||
- charm-nova-cloud-controller
|
||||
- charm-openstack-dashboard
|
||||
- charm-cinder
|
||||
- charm-glance
|
||||
- charm-heat
|
||||
- charm-swift-proxy
|
||||
|
||||
|
||||
Pausing Principal Unit
|
||||
----------------------
|
||||
If eg. 3 keystone units (keystone/0, keystone/1 and keystone/2) are deployed
|
||||
and keystone/0 is paused:
|
||||
|
||||
1) haproxy_servers on the other units (keystone/1 and keystone/2) will alert,
|
||||
because apache2 service on keystone/0 is down
|
||||
|
||||
2) haproxy, apache2.service and memcached.service in keystone/0 will also alert
|
||||
|
||||
3) it's possible that corosync and pacemaker have the VIP placed on the same
|
||||
unit at which point the service will fail as haproxy is disabled. So hacluster
|
||||
subordinate unit should also be paused.
|
||||
|
||||
Note: Services affected when pausing a principal unit may change depending on
|
||||
the principal charm
|
||||
|
||||
Pausing hacluster unit
|
||||
----------------------
|
||||
|
||||
Pausing hacluster set the cluster node, e.g keystone, in standby mode.
|
||||
A standby node will have its resources stopped (hacluster, apache2) which will
|
||||
fire false alerts. To solve this issue, the units of hacluster should inform
|
||||
the keystone unit that they are paused. A way of doing this is through the ha
|
||||
relation.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Pausing Pausing Principal Unit
|
||||
------------------------------
|
||||
Pause action on a principal unit should share the event with its peers to
|
||||
modify the behavior on them (until the resume action is triggered). It should
|
||||
also share the status (paused/resumed) to the subordinate unit to be able to
|
||||
catch-up the same status.
|
||||
|
||||
File actions.py in the principal unit
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def pause(args):
|
||||
pause_unit_helper(register_configs())
|
||||
|
||||
# Logic added to share the event with peers
|
||||
inform_peers_if_ready(check_api_unit_ready)
|
||||
if is_nrpe_joined():
|
||||
update_nrpe_config()
|
||||
|
||||
# logic added to inform hacluster subordinate unit has been paused
|
||||
relid = relation_ids('ha')
|
||||
for r_id in relid:
|
||||
relation_set(relation_id=r_id, paused=True)
|
||||
|
||||
def resume(args):
|
||||
resume_unit_helper(register_configs())
|
||||
|
||||
# Logic added to share the event with peers
|
||||
inform_peers_if_ready(check_api_unit_ready)
|
||||
if is_nrpe_joined():
|
||||
update_nrpe_config()
|
||||
|
||||
# logic added to inform hacluster subordinate unit has been resumed
|
||||
relid = relation_ids('ha')
|
||||
for r_id in relid:
|
||||
relation_set(relation_id=r_id, paused=False)
|
||||
|
||||
After pausing a principal unit, it will change the unit-state-{unit_name}
|
||||
to NOTREADY. E.g:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
juju show-unit keystone/0 --endpoint cluster
|
||||
keystone/0:
|
||||
workload-version: 17.0.0
|
||||
machine: "1"
|
||||
opened-ports:
|
||||
- 5000/tcp
|
||||
public-address: 10.5.2.64
|
||||
charm: cs:~openstack-charmers-next/keystone-562
|
||||
leader: true
|
||||
relation-info:
|
||||
- endpoint: cluster
|
||||
related-endpoint: cluster
|
||||
application-data: {}
|
||||
local-unit:
|
||||
in-scope: true
|
||||
data:
|
||||
admin-address: 10.5.2.64
|
||||
egress-subnets: 10.5.2.64/32
|
||||
ingress-address: 10.5.2.64
|
||||
internal-address: 10.5.2.64
|
||||
private-address: 10.5.2.64
|
||||
public-address: 10.5.2.64
|
||||
unit-state-keystone-0: NOTREADY
|
||||
|
||||
Note: unit-state-{unit_name} field is already implemented, I’m just proposing
|
||||
to use this field and change the value to NOTREADY when a unit is paused and
|
||||
return to READY when resumed.
|
||||
|
||||
|
||||
With every unit knowing which one is paused, it is possible to change the
|
||||
script check_haproxy.sh to accept a flag to warn the keystone units that
|
||||
are paused. The bash script is not able now to receive flags.
|
||||
|
||||
Check_haproxy.sh could be rewritten from Bash to Python to accept a flag
|
||||
to warn specific hostname (e.g. check_haproxy.py --warning keystone-0) is
|
||||
under maintenance.
|
||||
|
||||
The file nrpe.py on charmhelpers/contrib/charmsupport should have changes
|
||||
to first check if there is any paused unit in the cluster and then add the
|
||||
warning flag if necessary
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def add_haproxy_checks(nrpe, unit_name):
|
||||
"""
|
||||
Add checks for each service in list
|
||||
|
||||
:param NRPE nrpe: NRPE object to add check to
|
||||
:param str unit_name: Unit name to use in check description
|
||||
"""
|
||||
cmd = "check_haproxy.py"
|
||||
|
||||
peers_states = get_peers_unit_state()
|
||||
units_not_ready = [
|
||||
unit.replace('/', '-')
|
||||
for unit, state in peers_states.items()
|
||||
if state == UNIT_NOTREADY
|
||||
]
|
||||
|
||||
if is_unit_paused_set():
|
||||
units_not_ready.append(local_unit().replace('/', '-'))
|
||||
|
||||
if units_not_ready:
|
||||
cmd += " --warning {}".format(','.join(units_not_ready))
|
||||
|
||||
nrpe.add_check(
|
||||
shortname='haproxy_servers',
|
||||
description='Check HAProxy {%s}' % unit_name,
|
||||
check_cmd=cmd)
|
||||
nrpe.add_check(
|
||||
shortname='haproxy_queue',
|
||||
description='Check HAProxy queue depth {%s}' % unit_name,
|
||||
check_cmd='check_haproxy_queue_depth.sh')
|
||||
|
||||
When a principal unit changes the state e.g READY to NOTREADY, it’s necessary
|
||||
to rewrite the nrpe files on the other principal units in the cluster because,
|
||||
otherwise, they won’t be able to warn that a unit is under maintenance.
|
||||
|
||||
File responsible for hooks in the classic charms:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@hooks.hook('cluster-relation-changed')
|
||||
@restart_on_change(restart_map(), stopstart=True)
|
||||
def cluster_changed():
|
||||
# logic added to update nrpe_config in all principal units when
|
||||
# a status is changed
|
||||
update_nrpe_config()
|
||||
|
||||
Note: In reactive charms, it might be slightly different using handlers, but
|
||||
the mean idea is to update_nrpe_config every time that a config in the cluster
|
||||
is changed. This will prevent false alerts in the other units in the cluster.
|
||||
|
||||
|
||||
Services from Principal Unit
|
||||
------------------------------
|
||||
|
||||
Removing the .cfg files, when the unit is paused, for those services at
|
||||
/etc/nagios/nrpe.d would stop sending critical errors. The downside of this
|
||||
approach is that it won’t have user friendly messages in Nagios saying that
|
||||
the specific services (apache2, memcached and etc) is under maintenance, on
|
||||
the other hand, it’s simpler to be achieved.
|
||||
|
||||
File responsible for hooks in a classic charm:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@hooks.hook('nrpe-external-master-relation-joined',
|
||||
'nrpe-external-master-relation-changed')
|
||||
def update_nrpe_config():
|
||||
# logic before change
|
||||
# ...
|
||||
|
||||
nrpe_setup = nrpe.NRPE(hostname=hostname)
|
||||
nrpe.copy_nrpe_checks()
|
||||
|
||||
# added logic to remove services
|
||||
if is_unit_paused_set():
|
||||
nrpe.remove_init_service_checks(
|
||||
nrpe_setup,
|
||||
_services,
|
||||
current_unit
|
||||
)
|
||||
|
||||
else:
|
||||
nrpe.add_init_service_checks(
|
||||
nrpe_setup,
|
||||
_services,
|
||||
current_unit
|
||||
)
|
||||
|
||||
# end of added logic
|
||||
|
||||
nrpe.add_haproxy_checks(nrpe_setup, current_unit)
|
||||
nrpe_setup.write()
|
||||
|
||||
The new logic to remove those services is presented below.
|
||||
|
||||
File charmhelpers/contrib/charmsupport/nrpe.py
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# added logic to remove apache2, memcached and etc...
|
||||
def remove_init_service_checks(nrpe, services, unit_name):
|
||||
for svc in services:
|
||||
if host.init_is_systemd(service_name=svc):
|
||||
nrpe.remove_check(
|
||||
shortname=svc,
|
||||
description='process check {%s}' % unit_name,
|
||||
check_cmd='check_systemd.py %s' % svc
|
||||
)
|
||||
|
||||
The status of the services will disappear from nagios after some minutes.
|
||||
When the resume action is used, the services are restored initially as
|
||||
PENDING, but after some minutes the check is done.
|
||||
|
||||
Pausing hacluster unit
|
||||
----------------------
|
||||
|
||||
File actions.py in charm-hacluster:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def pause(args):
|
||||
"""Pause the hacluster services.
|
||||
@raises Exception should the service fail to stop.
|
||||
"""
|
||||
pause_unit()
|
||||
# logic added to inform keystone that unit has been paused
|
||||
relid = relation_ids('ha')
|
||||
for r_id in relid:
|
||||
relation_set(relation_id=r_id, paused=True)
|
||||
|
||||
|
||||
def resume(args):
|
||||
"""Resume the hacluster services.
|
||||
@raises Exception should the service fail to start."""
|
||||
resume_unit()
|
||||
# logic added to inform keystone that unit has been resumed
|
||||
relid = relation_ids('ha')
|
||||
for r_id in relid:
|
||||
relation_set(relation_id=r_id, paused=False)
|
||||
|
||||
|
||||
Pausing a hacluster would result in sharing a new variable paused that can be
|
||||
used in the principal units.
|
||||
|
||||
|
||||
File responsible for hooks in a classic charm:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@hooks.hook('ha-relation-changed')
|
||||
@restart_on_change(restart_map(), restart_functions=restart_function_map())
|
||||
def ha_changed():
|
||||
|
||||
# Added logic to pause keystone unit when hacluster is paused
|
||||
for rid in relation_ids('ha'):
|
||||
for unit in related_units(rid):
|
||||
paused = relation_get('paused', rid=rid, unit=unit)
|
||||
clustered = relation_get('clustered', rid=rid, unit=unit)
|
||||
if clustered and is_db_ready():
|
||||
if paused == 'True':
|
||||
pause_unit_helper(register_configs())
|
||||
|
||||
elif paused == 'False':
|
||||
resume_unit_helper(register_configs())
|
||||
|
||||
update_nrpe_config()
|
||||
inform_peers_if_ready(check_api_unit_ready)
|
||||
# inform subordinate unit that is paused or resumed
|
||||
relation_set(relation_id=rid, paused=is_unit_paused_set())
|
||||
|
||||
By informing peers and updating the nrpe config this will be enough to trigger
|
||||
the necessary logic to remove the services checks.
|
||||
|
||||
In a situation where the principal unit is paused, hacluster should also be
|
||||
paused. For this to happen, it can use the ha-relation-changed from
|
||||
charm-ha-cluster:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@hooks.hook('ha-relation-joined',
|
||||
'ha-relation-changed',
|
||||
'peer-availability-relation-joined',
|
||||
'peer-availability-relation-changed',
|
||||
'pacemaker-remote-relation-changed')
|
||||
def ha_relation_changed():
|
||||
# Inserted logic
|
||||
# pauses if the principal unit is paused
|
||||
paused = relation_get('paused')
|
||||
if paused == 'True':
|
||||
pause_unit()
|
||||
elif paused == 'False':
|
||||
resume_unit()
|
||||
|
||||
# share if the subordinate unit status
|
||||
for rel_id in relation_ids('ha'):
|
||||
relation_set(
|
||||
relation_id=rel_id,
|
||||
clustered="yes",
|
||||
paused=is_unit_paused_set()
|
||||
)
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
One alternative to services from the principal unit checks is to change
|
||||
systemd.py in charm-nrpe to accept flag -w like the proposal for the
|
||||
check_haproxy.py
|
||||
|
||||
This way would not be necessary to remove the .cfg files for services from
|
||||
the principal unit, but would be necessary to adapt the function
|
||||
`add_init_service_checks` to be able to accept services with the warning flag.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
gabrielcocenza
|
||||
|
||||
Gerrit Topic
|
||||
------------
|
||||
|
||||
Use Gerrit topic "pausing-charms-hacluster-no-false-alerts" for all patches
|
||||
related to this spec.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git-review -t pausing-charms-hacluster-no-false-alerts
|
||||
|
||||
Work Items
|
||||
----------
|
||||
- charmhelpers
|
||||
|
||||
- nrpe.py
|
||||
|
||||
- check_haproxy.py
|
||||
|
||||
- charm-ceilometer
|
||||
- charm-ceph-radosgw
|
||||
- charm-designate
|
||||
- charm-keystone
|
||||
- charm-neutron-api
|
||||
- charm-nova-cloud-controller
|
||||
- charm-openstack-dashboard
|
||||
- charm-cinder
|
||||
- charm-glance
|
||||
- charm-heat
|
||||
- charm-swift-proxy
|
||||
|
||||
- charm-nrpe (Alternative)
|
||||
|
||||
- systemd.py
|
||||
|
||||
- charm-hacluster
|
||||
|
||||
- actions.py
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
No new git Repository is required.
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
It will be necessary to document the impact of pausing/resuming a
|
||||
subordinate hacluster and the side effect on Openstack API charms.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
No additional security concerns.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Code changes will be covered by unit and functional tests. For functional
|
||||
tests, it will use a bundle with keystone, hacluster, nrpe and nagios.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
Loading…
Reference in New Issue
Block a user