Ceilometer compute `retry_on_disconnect` using `no-wait`

It was discovered a problem on a production setup of Ceilometer compute
with metrics stopping to be gathered. While troubleshooting, we found
the following error message.

```
ERROR ceilometer.polling.manager [-] Prevent pollster cpu from polling
```

That error message happened after the following message:

```
WARNING ceilometer.compute.pollsters [-] Cannot inspect data of
CPUPollster for <UUID>, non-fatal reason: Failed to inspect instance
<UUID> stats, can not get info from libvirt: Unable to read from
monitor: Connection reset by peer: NoDataException: Failed to inspect
instance <UUID> stats, can not get info from libvirt: Unable to read
from monitor: Connection reset by peer
```

The instance was running just fine in the host. It seems a concurrency
issue with some other process that made the instance locked/unavailable
to ceilometer computer pollsters. Ceilometer was unable to connect to
Libvirt (after 2 retries), and the code is designed to prevent
Ceilometer from continuing trying. Therefore, the "CPU" metric pollster
was put in permanent error. To fix the issue, We needed to restart
Ceilometer in the affected hosts. However, until we discovered this
issue, we lost the amount 3 days of data.

```
@libvirt_utils.raise_nodata_if_unsupported
@libvirt_utils.retry_on_disconnect
def inspect_instance(self, instance, duration=None):
    domain = self._get_domain_not_shut_off_or_raise(instance)
```

It will try to retrieve the domain (VM) object (XML description) via
libvirt. If it fails, it will retry via
`@libvirt_utils.retry_on_disconnect`; if that fails, it marks the
metric in permanent error with the annotation:
`@libvirt_utils.raise_nodata_if_unsupported`.

Other metrics continued working. Therefore, I investigated a bit
deeper, and the problem seems to be here:

```
retry_on_disconnect = tenacity.retry(
    retry=tenacity.retry_if_exception(is_disconnection_exception),
    stop=tenacity.stop_after_attempt(2))
```

The `retry_on_disconnect`  annotation is not configuring the "tenacity"
retry library wait. The default is "no wait". Therefore, the retries
have a bigger chance of being affected by very minor instabilities
(microseconds connection issues can generate a problem with this
configuration). One alternative to avoid such problems in the future
is to use a wait configuration such as the one being proposed. Then,
ceilometer computer pollsters would wait/sleep before retrying, which
would provide some time for the system to be available for the compute
pollsters.

In this proposal, we would wait 2^x * 3 seconds between each retry
starting with 1 second, then up to 60 seconds.

Change-Id: I9a2d46f870dc2d2791a7763177773dc0cf8aed9d
This commit is contained in:
Rafael Weingärtner 2021-04-08 09:43:36 -03:00
parent 122c55591f
commit b664d4ea01
3 changed files with 4 additions and 3 deletions

View File

@ -108,7 +108,8 @@ def is_disconnection_exception(e):
retry_on_disconnect = tenacity.retry(
retry=tenacity.retry_if_exception(is_disconnection_exception),
stop=tenacity.stop_after_attempt(2))
stop=tenacity.stop_after_attempt(3),
wait=tenacity.wait_exponential(multiplier=3, min=1, max=60))
def raise_nodata_if_unsupported(method):

View File

@ -41,7 +41,7 @@ requests-aws==0.1.4
six==1.10.0
stestr==2.0.0
stevedore==3.0.0
tenacity==4.12.0
tenacity==6.3.1
testresources==2.0.1
testscenarios==0.4
testtools==2.2.0

View File

@ -31,7 +31,7 @@ python-cinderclient>=3.3.0 # Apache-2.0
PyYAML>=5.1 # MIT
requests!=2.9.0,>=2.8.1 # Apache-2.0
stevedore>=1.20.0 # Apache-2.0
tenacity>=4.12.0 # Apache-2.0
tenacity>=6.3.1,<7.0.0 # Apache-2.0
tooz[zake]>=1.47.0 # Apache-2.0
os-xenapi>=0.3.3 # Apache-2.0
oslo.cache>=1.26.0 # Apache-2.0