Update patch set 2

Patch Set 2: (1 comment) Patch-set: 2
2019-05-07 17:22:24 +00:00 · 2019-05-07 17:22:24 +00:00 · 5556509539
parent 305706610f
commit 5556509539
1 changed files with 18 additions and 0 deletions
--- a/18
+++ b/18
@ -34,6 +34,24 @@
      "revId": "a7aca4f0575666610526e24f31a41a4a7014e7be",
      "serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543",
      "unresolved": false
+    },
+    {
+      "key": {
+        "uuid": "dfbec78f_2b3cfdc6",
+        "filename": "specs/rpc-health-checks.rst",
+        "patchSetId": 2
+      },
+      "lineNbr": 43,
+      "author": {
+        "id": 2394
+      },
+      "writtenOn": "2019-05-07T17:22:24Z",
+      "side": 1,
+      "message": "@Dan Firstly, apologies for not managing to grab you at the PTG for a quick hallway discussion on this.  Had too many other things to think about :-/\n\n \u003e That said, nova also already kinda does this in the form of service checkins. Meaning, if RPC is working, compute writes its liveness to the DB via conductor. The other services do the same, but directly to do the DB.\n\nThis is of course useful, but to me it doesn\u0027t feel like a complete solution to the problem of pin-pointing failures during automated root cause analysis or self-healing.  For example, if a service\u0027s liveness updates cease to appear in the DB, that doesn\u0027t give us an accurate read on where the failure is.  For example, it could be an issue with the DB or the network on the DB end, or with the network on the service\u0027s end, or an issue with the whole machine running the service.  All of these are of course significantly different failure modes to \"there\u0027s a problem with the service\".\n\nAdditionally, if another service such as conductor is proxying the liveness, then that adds another layer of failure modes which can potentially muddy the waters.\n\n \u003e To me, health checks should really be done as described previously, via supporting the normal http-based health checks on each service, which can then report things like whether or not RPC seems to be working, among other things.\n\nTo clarify, are you suggesting that RPC-only services which don\u0027t currently expose an HTTP endpoint should do so?  That\u0027s something the community has been considering for quite a while in the context of the health check APIs discussion:\n\n   https://storyboard.openstack.org/#!/story/2001439\n\nbut it has been a somewhat controversial suggestion, with some people concerned about potential bloat and impact on the security/deployment model.  Having said that, Dirk and I did discuss it with Ben Nemec on Saturday just as the PTG finished, and he didn\u0027t seem to have violent objections. (FWIW personally I haven\u0027t made up my mind either way yet.)\n\nIf OTOH you are suggesting that the existing HTTP endpoints should act as proxies for RPC component health, then I\u0027m pretty sure that\u0027s not going to work for the reasons stated above.\n\nAnd even though we *do* potentially want to expand the existing HTTP endpoints to provide not just health-check data but also internal performance metrics (which was the main thrust of the Saturday discussion), this is out of scope for the specific use case we are trying to address here, i.e. allowing accurate liveness probes from k8s to RPC services.  In that use case, k8s doesn\u0027t need anything more than a simple binary dead-or-alive answer.\n\n \u003e Doing it in-band with RPC doesn\u0027t really tell us much that the existing service reporting isn\u0027t already telling us. If the rpc in-band check works, then likely the service checkin is working as well.\n\n\"likely\" is the key word here.  As mentioned above, for effective RCA via the likes of Vitrage, and effective managment by the likes of k8s, we need checks to be as accurate as possible, otherwise we can\u0027t build self-healing workflows which we can trust to do the right thing.  For example, if the conductor dies, we can\u0027t afford our compute liveness probes to trigger false alarms, even momentarily.\n\n \u003e Further, the above-mentioned probe tool puts a non-nova component on the private nova rpc bus, which isn\u0027t a very good idea, IMHO.\n\nThis is an interesting point.  Any chance you could elaborate on your concerns here?  Of course an RPC liveness probe would have to act responsibly and not spam the bus, but are there other risks here, e.g. security-related ones?",
+      "parentUuid": "ffb9cba7_f9d68e27",
+      "revId": "a7aca4f0575666610526e24f31a41a4a7014e7be",
+      "serverId": "4a232e18-c5a9-48ee-94c0-e04e7cca6543",
+      "unresolved": false
    }
  ]
 }