From 60b5e174db040366d49f7f82ed8e37e4671048cc Mon Sep 17 00:00:00 2001
From: Tobias Henkel <tobias.henkel@bmw.de>
Date: Wed, 5 Jun 2019 21:45:00 +0200
Subject: [PATCH] Add spec for enhanced regional executor distribution

When running large distributed deployments it can be desirable to keep
traffic as local as possible. To facilitate this zuul supports zoning
of zuul-executors. Using zones executors only process jobs on nodes
that are running in the same zone. This works well in many
cases. However there is currently a limitation around live log
streaming that makes it impossible to use this feature in certain
environments.

Live log streaming via zuul-web or zuul-fingergw requires that each
executor is directly addressable from zuul-web or zuul-fingergw. This
is not the case if

* zuul-executors are behind a NAT. In this case one would need to
  create a NAT rule per executor on different ports which can become a
  maintenance nightmare.

* zuul-executors run in a different Kubernetes or OpenShift. In this
  case one would need a Ingress/Route or NodePort per executor which
  also makes maintenance really hard.

Change-Id: Iecff1b05a93b34c2a80d18df21eafdcc46f6fdfa
---
 .../specs/enhanced-regional-executors.rst     | 137 ++++++++++++++++++
 .../reference/developer/specs/index.rst       |   1 +
 2 files changed, 138 insertions(+)
 create mode 100644 doc/source/reference/developer/specs/enhanced-regional-executors.rst

diff --git a/doc/source/reference/developer/specs/enhanced-regional-executors.rst b/doc/source/reference/developer/specs/enhanced-regional-executors.rst
new file mode 100644
index 0000000000..076de7a9c5
--- /dev/null
+++ b/doc/source/reference/developer/specs/enhanced-regional-executors.rst
@@ -0,0 +1,137 @@
+================================================
+Enhanced regional distribution of zuul-executors
+================================================
+
+.. warning:: This is not authoritative documentation.  These features
+   are not currently available in Zuul.  They may change significantly
+   before final implementation, or may never be fully completed.
+
+Problem description
+===================
+
+When running large distributed deployments it can be desirable to keep traffic
+as local as possible. To facilitate this zuul supports zoning of zuul-executors.
+Using zones executors only process jobs on nodes that are running in the same
+zone. This works well in many cases. However there is currently a limitation
+around live log streaming that makes it impossible to use this feature in
+certain environments.
+
+Live log streaming via zuul-web or zuul-fingergw requires that each executor
+is directly addressable from zuul-web or zuul-fingergw. This is not the case
+if
+
+* zuul-executors are behind a NAT. In this case one would need to create a NAT
+  rule per executor on different ports which can become a maintenance nightmare.
+
+* zuul-executors run in a different Kubernetes or OpenShift. In this case one
+  would need a Ingress/Route or NodePort per executor which also makes
+  maintenance really hard.
+
+Proposed change
+---------------
+
+In both use cases it would be desirable to have one service in each zone that
+can further dispatch log streams within its own zone. Addressing a single
+service is much more feasable by e.g. a single NAT rule or a Route or NodePort
+service in Kubernetes.
+
+.. graphviz::
+   :align: center
+
+    graph {
+        graph [fontsize=10 fontname="Verdana"];
+        node [fontsize=10 fontname="Verdana"];
+
+        user [ label="User" ];
+
+        subgraph cluster_1 {
+            node [style=filled];
+            label = "Zone 1";
+            web [ label="Web" ];
+            executor_1 [ label="Executor 1" ];
+        }
+
+        subgraph cluster_2 {
+            node [style=filled];
+            label = "Zone 2";
+            route [ label="Route/Ingress/NAT" ]
+            fingergw_zone2 [ label="Fingergw Zone 2"];
+            executor_2 [ label="Executor 2" ];
+            executor_3 [ label="Executor 3" ];
+        }
+
+      user -- web [ constraint=false ];
+
+      web -- executor_1
+
+      web -- route [ constraint=false ]
+      route -- fingergw_zone2
+      fingergw_zone2 -- executor_2
+      fingergw_zone2 -- executor_3
+
+    }
+
+Current log streaming is essentially the same for zuul-web and zuul-fingergw and
+works like this:
+
+* Fingergw gets stream request by user
+* Fingergw resolves stream address by calling get_job_log_stream_address and
+  supplying a build uuid
+* Scheduler responds with the executor hostname and port on which the build
+  is running.
+* Fingergw connects to the stream address, supplies the build uuid and connects
+  the streams.
+
+The proposed process is almost the same:
+
+* Fingergw gets stream request by user
+* Fingergw resolves stream address by calling get_job_log_stream_address and
+  supplying the build uuid *and the zone of the fingergw (optional)*
+* Scheduler responds:
+
+   * Address of executor if the zone provided with the request matches the zone
+     of the executor running the build, or the executor is un-zoned.
+   * Address of fingergw in the target zone otherwise.
+
+* Fingergw connects to the stream address, supplies the build uuid and connects
+  the streams.
+
+In case the build runs in a different zone the fingergw in the target zone will
+follow the exact same process and get the executor stream process as this will
+be in the same zone.
+
+
+In order to facilitate this the following changes need to be made:
+
+* The fingergw registers itself in the zk component registry and offers its
+  hostname, port and optionally zone. The hostname further needs to be
+  configurable like it is for the executors.
+
+* Zuul-web and fingergw need a new optional config parameter containing their
+  zone.
+
+While zuul-web and zuul-fingergw will be aware of what zone they are running in,
+end-users will not need this information; the user-facing instances of those
+services will continue to serve the entirely of the Zuul system regardless of
+which zone they reside in, all from a single public URL or address.
+
+
+Gearman
+-------
+
+The easiest and most standard way of getting non-http traffic into a
+Kubernetes/Openshift cluster is using Ingres/Routes in combination with TLS and
+SNI (server name indication). SNI is used in this case for dispatching the
+connection to the correct service. Gearman currently doesn't support SNI which
+makes it harder to route it into an Kubernetes/Openshift cluster from outside.
+
+
+Security considerations
+-----------------------
+
+Live log streams can potentially contain sensitive data. Especially when
+transferring them between different datacenters encryption would be useful.
+So we should support optionally encrypting the finger streams using TLS with
+optional client auth like we do with gearman. The mechanism should also support
+SNI (Server name indication).
+
diff --git a/doc/source/reference/developer/specs/index.rst b/doc/source/reference/developer/specs/index.rst
index 84a0647928..03669f2ce0 100644
--- a/doc/source/reference/developer/specs/index.rst
+++ b/doc/source/reference/developer/specs/index.rst
@@ -21,3 +21,4 @@ documentation instead.
    circular-dependencies
    scale-out-scheduler
    zuul-runner
+   enhanced-regional-executors