Ceilometer compute `retry_on_disconnect` using `no-wait`

It was discovered a problem on a production setup of Ceilometer compute with metrics stopping to be gathered. While troubleshooting, we found the following error message. ``` ERROR ceilometer.polling.manager [-] Prevent pollster cpu from polling ``` That error message happened after the following message: ``` WARNING ceilometer.compute.pollsters [-] Cannot inspect data of CPUPollster for <UUID>, non-fatal reason: Failed to inspect instance <UUID> stats, can not get info from libvirt: Unable to read from monitor: Connection reset by peer: NoDataException: Failed to inspect instance <UUID> stats, can not get info from libvirt: Unable to read from monitor: Connection reset by peer ``` The instance was running just fine in the host. It seems a concurrency issue with some other process that made the instance locked/unavailable to ceilometer computer pollsters. Ceilometer was unable to connect to Libvirt (after 2 retries), and the code is designed to prevent Ceilometer from continuing trying. Therefore, the "CPU" metric pollster was put in permanent error. To fix the issue, We needed to restart Ceilometer in the affected hosts. However, until we discovered this issue, we lost the amount 3 days of data. ``` @libvirt_utils.raise_nodata_if_unsupported @libvirt_utils.retry_on_disconnect def inspect_instance(self, instance, duration=None): domain = self._get_domain_not_shut_off_or_raise(instance) ``` It will try to retrieve the domain (VM) object (XML description) via libvirt. If it fails, it will retry via `@libvirt_utils.retry_on_disconnect`; if that fails, it marks the metric in permanent error with the annotation: `@libvirt_utils.raise_nodata_if_unsupported`. Other metrics continued working. Therefore, I investigated a bit deeper, and the problem seems to be here: ``` retry_on_disconnect = tenacity.retry( retry=tenacity.retry_if_exception(is_disconnection_exception), stop=tenacity.stop_after_attempt(2)) ``` The `retry_on_disconnect` annotation is not configuring the "tenacity" retry library wait. The default is "no wait". Therefore, the retries have a bigger chance of being affected by very minor instabilities (microseconds connection issues can generate a problem with this configuration). One alternative to avoid such problems in the future is to use a wait configuration such as the one being proposed. Then, ceilometer computer pollsters would wait/sleep before retrying, which would provide some time for the system to be available for the compute pollsters. In this proposal, we would wait 2^x * 3 seconds between each retry starting with 1 second, then up to 60 seconds. Change-Id: I9a2d46f870dc2d2791a7763177773dc0cf8aed9d
2021-04-08 09:43:36 -03:00 · 2021-04-08 09:43:36 -03:00 · b664d4ea01
parent 122c55591f
commit b664d4ea01
3 changed files with 4 additions and 3 deletions
--- a/ceilometer/compute/virt/libvirt/utils.py
+++ b/ceilometer/compute/virt/libvirt/utils.py
@ -108,7 +108,8 @@ def is_disconnection_exception(e):

 retry_on_disconnect = tenacity.retry(
    retry=tenacity.retry_if_exception(is_disconnection_exception),
-    stop=tenacity.stop_after_attempt(2))
+    stop=tenacity.stop_after_attempt(3),
+    wait=tenacity.wait_exponential(multiplier=3, min=1, max=60))


 def raise_nodata_if_unsupported(method):
--- a/lower-constraints.txt
+++ b/lower-constraints.txt
@ -41,7 +41,7 @@ requests-aws==0.1.4
 six==1.10.0
 stestr==2.0.0
 stevedore==3.0.0
-tenacity==4.12.0
+tenacity==6.3.1
 testresources==2.0.1
 testscenarios==0.4
 testtools==2.2.0
--- a/requirements.txt
+++ b/requirements.txt
@ -31,7 +31,7 @@ python-cinderclient>=3.3.0  # Apache-2.0
 PyYAML>=5.1 # MIT
 requests!=2.9.0,>=2.8.1 # Apache-2.0
 stevedore>=1.20.0 # Apache-2.0
-tenacity>=4.12.0  # Apache-2.0
+tenacity>=6.3.1,<7.0.0  # Apache-2.0
 tooz[zake]>=1.47.0 # Apache-2.0
 os-xenapi>=0.3.3 # Apache-2.0
 oslo.cache>=1.26.0 # Apache-2.0