fuel-library

History

Bogdan Dobrelya bf604f80d7 Restart rabbit if can't list queues or found memory alert W/o this fix the dead end situation is possible when the rabbit node have no free memory resources left and the cluster blocks all publishing, by design. But the app thinks "let's wait for the publish block have lifted" and cannot recover. The workaround is to monitor results of crucial rabbitmqctl commands and restart the rabbit node, if queues/channels/alarms cannot be listed or if there are memory alarms found. This is the similar logic as we have for the cases when rabbitmqctl list_channels hangs. But the channels check is also fixed to verify if the exit code>0 when the rabbit app is running. Additional checks added to the monitor also require extending the timeout window for the monitor action from 60 to 180 seconds. Besides that, this patch makes the monitor action to gather the rabbit status and runtime stats, like consumed memory by all queues of total Mem+Swap, total messages in all queues and average queue consumer utilization. This info should help to troubleshoot failures better. DocImpact: ops guide. If any rabbitmq node exceeded its memory threshold the publish became blocked cluster-wide, by design. For such cases, this rabbit node would be recovered from the raised memory alert and immediately stopped to be restarted later by the pacemaker. Otherwise, this blocked publishing state might never have been lifted, if the pressure persists from the OpenStack apps side. Closes-bug: #1463433 Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>	2015-06-12 09:39:27 +02:00
..
manifests	Restart rabbit if can't list queues or found memory alert	2015-06-12 09:39:27 +02:00
templates	Add heat auth_encryption_key handling	2014-11-13 14:40:13 +03:00

Bogdan Dobrelya bf604f80d7 Restart rabbit if can't list queues or found memory alert

W/o this fix the dead end situation is possible
when the rabbit node have no free memory resources left
and the cluster blocks all publishing, by design.
But the app thinks "let's wait for the publish block have
lifted" and cannot recover.

The workaround is to monitor results
of crucial rabbitmqctl commands and restart the rabbit node,
if queues/channels/alarms cannot be listed or if there are
memory alarms found.
This is the similar logic as we have for the cases when
rabbitmqctl list_channels hangs. But the channels check is also
fixed to verify if the exit code>0 when the rabbit app is
running.

Additional checks added to the monitor also require extending
the timeout window for the monitor action from 60 to 180 seconds.

Besides that, this patch makes the monitor action to gather the
rabbit status and runtime stats, like consumed memory by all
queues of total Mem+Swap, total messages in all queues and
average queue consumer utilization. This info should help to
troubleshoot failures better.

DocImpact: ops guide. If any rabbitmq node exceeded its memory
threshold the publish became blocked cluster-wide, by design.
For such cases, this rabbit node would be recovered from the
raised memory alert and immediately stopped to be restarted
later by the pacemaker. Otherwise, this blocked publishing state
might never have been lifted, if the pressure persists from the
OpenStack apps side.

Closes-bug: #1463433

Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

2015-06-12 09:39:27 +02:00

manifests

Restart rabbit if can't list queues or found memory alert

2015-06-12 09:39:27 +02:00

templates

Add heat auth_encryption_key handling

2014-11-13 14:40:13 +03:00