From c3a541061934c7446e1056e634e6eba4607152ee Mon Sep 17 00:00:00 2001
From: Mark Burnett <mark.m.burnett@gmail.com>
Date: Mon, 6 Nov 2017 16:03:03 -0600
Subject: [PATCH] Docs: Add design doc

This adds an initial description of Promenade's design.

Change-Id: I76060bcacf67ef2422c7d7514dcdc72fcd49d0f0
---
 README.md              |  10 +-
 docs/source/design.rst | 229 +++++++++++++++++++++++++++++++++++++++++
 docs/source/index.rst  |   1 +
 tox.ini                |   2 +-
 4 files changed, 238 insertions(+), 4 deletions(-)
 create mode 100644 docs/source/design.rst

diff --git a/README.md b/README.md
index b04370a3..c91338ec 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,9 @@
 # Promenade
 
 Promenade is a tool for bootstrapping a resilient Kubernetes cluster and
-managing its life-cycle.
+managing its life-cycle via Helm charts.
+
+Documentation can be found [here](https://promenade.readthedocs.io).
 
 ## Roadmap
 
@@ -21,9 +23,11 @@ The detailed Roadmap can be viewed on the
 
 ## Getting Started
 
-To get started, see [getting started](docs/getting-started.md).
+To get started, see
+[getting started](https://promenade.readthedocs.io/en/latest/getting-started.html).
 
-Configuration is documented [here](docs/configuration.md).
+Configuration is documented
+[here](https://promenade.readthedocs.io/en/latest/configuration/index.html).
 
 ## Bugs
 
diff --git a/docs/source/design.rst b/docs/source/design.rst
new file mode 100644
index 00000000..68845c92
--- /dev/null
+++ b/docs/source/design.rst
@@ -0,0 +1,229 @@
+Design
+======
+
+Promenade is a Kubernetes_ cluster deployment tool with the following goals:
+
+* Resiliency in the face of node loss and full cluster reboot.
+* Bare metal node support without external runtime dependencies.
+* Providing a fully functional single-node cluster to allow cluster-hosted
+  `tooling <https://github.com/att-comdev/treasuremap>`_ to provision the
+  remaining cluster nodes.
+* Helm_ chart managed component life-cycle.
+* API-managed cluster life-cycle.
+
+
+Cluster Bootstrapping
+---------------------
+
+The cluster is bootstrapped on a single node, called the genesis node.  This
+node goes through a short-lived bootstrapping phase driven by static pod
+manifests consumed by ``kubelet``, then quickly moves to chart-managed
+infrastructure, driven by Armada_.
+
+During the bootstrapping phase, the following temporary components are run as
+static pods which are configured directly from Promenade's configuration
+documents:
+
+* Kubernetes_ core components
+
+    * ``apiserver``
+    * ``controller-manager``
+    * ``scheduler``
+
+* Etcd_ for use by the Kubernetes_ ``apiserver``
+* Helm_'s server process ``tiller``
+* CoreDNS_ to be used for Kubernetes_ ``apiserver`` discovery
+
+With these components up, it is possible to leverage Armada_ to deploy Helm_
+charts to manage these components (and additional components) going forward.
+
+Though completely configurable, a typical Armada_ manifest should specify
+charts for:
+
+* Kubernetes_ components
+
+    * ``apiserver``
+    * ``controller-manager``
+    * ``proxy``
+    * ``scheduler``
+
+* Cluster DNS (e.g. CoreDNS_)
+* Etcd_ for use by the Kubernetes_ ``apiserver``
+* A CNI_ provider for Kubernetes_ (e.g. Calico_)
+* An initial under-cloud system to allow cluster expansion, including
+  components like Armada_, Deckhand_, Drydock_ and Shipyard_.
+
+Once these charts are deployed, the cluster is validated (currently, validation
+is limited to resolving DNS queries and verifying basic Kubernetes
+functionality including ``Pod`` scheduling log collection), and then the
+genesis process is complete.  Additional nodes can be added to the cluster
+using day 2 procedures.
+
+After additional master nodes are added to the cluster, it is possible to
+remove the genesis node from the cluster so that it can be fully re-provisioned
+using the same process as for all the other nodes.
+
+
+Life-cycle Management
+---------------------
+
+There are two sets of resources that require life-cycle management:  cluster
+nodes and Kubernetes_ control plane components.  These two sets of resources
+are managed differently.
+
+
+Node Life-Cycle Management
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Node life-cycle management tools are provided via an API to be consumed by
+other tools like Drydock_ and Shipyard_.
+
+The life-cycle operations for nodes are:
+
+1. Adding a node to the cluster
+2. Removing a node from the cluster
+3. Adding and removing node labels.
+
+
+Adding a node to the cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Adding a node to the cluster is done by running a shell script on the node that
+installs the ``kubelet`` and configures it to find and join the cluster.  This
+script can either be generated up front via the CLI, or it can be obtained via
+the `join-scripts` endpoint of the API (development of this API is in-progress).
+
+Nodes can only be joined assuming all the proper configuration documents are
+available, including required certificates for Kubelet.
+
+
+Removing a node from the cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This is currently possible by leveraging the ``promenade-teardown`` script
+placed on each host.  API support for this function is planned, but not yet
+implemented.
+
+Adding and removing node labels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This is currently only possible directly via ``kubectl``, though API support
+for this functionality is planned.
+
+It through relabeling nodes that key day 2 operations functionality like moving
+a master node are achieved.
+
+
+Control-Plane Component Life-Cycle Management
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+With the exception of the Docker_ daemon and the ``kubelet``, life-cycle
+management of control plane components is handled via Helm_ chart updates,
+which are orchestrated by Armada_.
+
+The Docker_ daemon is managed as an APT package, with configuration installed
+at the time the node is configured to join the cluster.
+
+The ``kubelet`` is directly installed and configured at the time nodes join the
+cluster.  Work is in progress to improve the upgradability of ``kubelet`` via
+either a system package or a chart.
+
+
+Resiliency
+----------
+
+The two primary failure scenarios Promenade is designed to be resilient against
+are node loss and full cluster restart.
+
+Kubernetes_ has a well-defined `High Availability
+<https://kubernetes.io/docs/admin/high-availability/>`_ pattern, which deals
+well with node loss.
+
+However, this pattern requires an external load balancer for ``apiserver``
+discovery.  Since it is a goal of this project for the cluster to be able to
+operate without ongoing external dependencies, we must avoid that requirement.
+
+Additionally, in the event of full cluster restart, we cannot rely on any
+response from the ``apiserver`` to give any ``kubelet`` direction on what
+processes to run.  That means, each master node must be self-sufficient, so
+that once a quorum of Etcd_ members is achieved the cluster may resume normal
+operation.
+
+The solution approach is two-pronged:
+
+1. Deploy a local discovery mechanism for the ``apiserver`` processes on each
+   node so that core components can always find the ``apiservers`` when their
+   nodes reboot.
+2. Apply the Anchor pattern described below to ensure that essential components
+   on master nodes restart even when the ``apiservers`` are not available.
+
+Currently, the discovery mechanism for the ``apiserver`` processes is provided
+by CoreDNS_ via a zone file written to disk on each node.  This approach has
+some drawbacks, which might be addressed in future work by leveraging a
+HAProxy_ for discovery instead.
+
+
+Anchor Pattern
+^^^^^^^^^^^^^^
+
+The anchor pattern provides a way to manage process life-cycle using Helm_
+charts in a way that allows them to be restarted immediately in the event of a
+node restart -- even when the Kubernetes_ ``apiserver`` is unreachable.
+
+In this pattern, a ``DaemonSet`` called the ``anchor`` that runs on selected
+nodes and is responsible for managing the life-cycle of assets deployed onto
+the node file system.  In particular, these assets include a Kubernetes_
+``Pod`` manifest to be consumed by ``kubelet`` and it manages the processes
+specified by the ``Pod``.  That management continues even when the node
+reboots, since static pods like this are run by the ``kubelet`` even when the
+``apiserver`` is not available.
+
+Cleanup of these resources is managed by the ``anchor`` pods' ``preStop``
+life-cycle hooks.  This is usually simply removing the files originally placed
+on the nodes' file systems, but, e.g. in the case of Etcd_, can actually be
+used to manage more complex cleanup like removal from cluster membership.
+
+
+Alternatives
+------------
+
+* Kubeadm_
+
+    * Does not yet support
+      `HA <https://github.com/kubernetes/kubeadm/issues/261>`_
+    * Current approach to HA Etcd_ is to use the
+      `etcd opreator <https://github.com/coreos/etcd-operator>`_, which
+      recovers from cluster reboot by loading from an external backup snapshot
+    * Does not support chart-based management of components
+
+* kops_
+
+    * Does not support `bare metal <https://github.com/kubernetes/features/issues/360>`_
+
+* Bootkube_
+
+    * Does not support automatic recovery from a
+      `full cluster reboot <https://github.com/kubernetes-incubator/bootkube/blob/master/Documentation/disaster-recovery.md>`_
+    * Does not yet support
+      `full HA <https://github.com/kubernetes-incubator/bootkube/issues/311>`_
+    * Adheres to different design goals (minimal direct server contact), which
+      makes some of these changes challenging, e.g.
+      `building a self-contained, multi-master cluster <https://github.com/kubernetes-incubator/bootkube/pull/684#issuecomment-323886149>`_
+    * Does not support chart-based management of components
+
+
+.. _Armada: https://github.com/att-comdev/armada
+.. _Bootkube: https://github.com/kubernetes-incubator/bootkube
+.. _CNI: https://github.com/containernetworking/cni
+.. _Calico: https://github.com/projectcalico/calico
+.. _CoreDNS: https://github.com/coredns/coredns
+.. _Deckhand: https://github.com/att-comdev/deckhand
+.. _Docker: https://www.docker.com
+.. _Drydock: https://github.com/att-comdev/drydock
+.. _Etcd: https://github.com/coreos/etcd
+.. _HAProxy: http://www.haproxy.org
+.. _Helm: https://github.com/kubernetes/helm
+.. _kops: https://github.com/kubernetes/kops
+.. _Kubeadm: https://github.com/kubernetes/kubeadm
+.. _Kubernetes: https://github.com/kubernetes/kubernetes
+.. _Shipyard: https://github.com/att-comdev/shipyard
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 6f2854e8..8a7dc672 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -30,6 +30,7 @@ Promenade Configuration Guide
 .. toctree::
    :maxdepth: 2
 
+   design
    getting-started
    configuration/index
    api
diff --git a/tox.ini b/tox.ini
index fd0edbbe..7f85cb16 100644
--- a/tox.ini
+++ b/tox.ini
@@ -1,5 +1,5 @@
 [tox]
-envlist = bandit,lint
+envlist = bandit,lint,docs
 
 [testenv]
 deps = -r{toxinidir}/requirements.txt