Browse Source

Fix systemd service start rate limiting

The default limit is to allow 5 restarts in a 10sec period. If a
service goes over that threshold due to the Restart= config option in
the service definition, it will not attempt to restart any further.

We should not set StartLimitIntervalSec to 0 to disable any kind of
rate limiting as that may end up impacting the node load.

Instead, use tenacity to retry with an exponential backoff, when the
service unit enablement fails. Before to retry it, reset the unit's
failure counters with the systemctl wrapper. This is a crash-loop
approach that provides an efficient feature parity to the classic
rate limiting, shall we want to implement that for the systemctl
command wrapper instead.

Closes-bug: #1839841

Change-Id: I537fbf9933f2cbe6e1c2f627ba77da645bd55f25
Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
(cherry picked from commit 2eaebe2cd9)
tags/4.5.1
Bogdan Dobrelya 1 month ago
parent
commit
5b5e578cc5
1 changed files with 23 additions and 1 deletions
  1. 23
    1
      paunch/utils/systemctl.py

+ 23
- 1
paunch/utils/systemctl.py View File

@@ -13,6 +13,7 @@
13 13
 # License for the specific language governing permissions and limitations
14 14
 # under the License.
15 15
 import subprocess
16
+import tenacity
16 17
 
17 18
 from paunch.utils import common
18 19
 
@@ -45,12 +46,33 @@ def daemon_reload(log=None):
45 46
     systemctl(['daemon-reload'], log)
46 47
 
47 48
 
49
+def reset_failed(service, log=None):
50
+    systemctl(['reset-failed', service], log)
51
+
52
+
53
+# NOTE(bogdando): this implements a crash-loop with reset-failed
54
+# counters approach that provides an efficient feature parity to the
55
+# classic rate limiting, shall we want to implement that for the
56
+# systemctl command wrapper instead.
57
+@tenacity.retry(  # Retry up to 5 times with jittered exponential backoff
58
+    reraise=True,
59
+    retry=tenacity.retry_if_exception_type(
60
+        SystemctlException
61
+    ),
62
+    wait=tenacity.wait_random_exponential(multiplier=1, max=10),
63
+    stop=tenacity.stop_after_attempt(5)
64
+)
48 65
 def enable(service, now=True, log=None):
49 66
     cmd = ['enable']
50 67
     if now:
51 68
         cmd.append('--now')
52 69
     cmd.append(service)
53
-    systemctl(cmd, log)
70
+    try:
71
+        systemctl(cmd, log)
72
+    except SystemctlException as err:
73
+        # Reset failure counters for the service unit and retry
74
+        reset_failed(service, log)
75
+        raise SystemctlException(str(err))
54 76
 
55 77
 
56 78
 def disable(service, log=None):

Loading…
Cancel
Save