From c3a541061934c7446e1056e634e6eba4607152ee Mon Sep 17 00:00:00 2001 From: Mark Burnett Date: Mon, 6 Nov 2017 16:03:03 -0600 Subject: [PATCH] Docs: Add design doc This adds an initial description of Promenade's design. Change-Id: I76060bcacf67ef2422c7d7514dcdc72fcd49d0f0 --- README.md | 10 +- docs/source/design.rst | 229 +++++++++++++++++++++++++++++++++++++++++ docs/source/index.rst | 1 + tox.ini | 2 +- 4 files changed, 238 insertions(+), 4 deletions(-) create mode 100644 docs/source/design.rst diff --git a/README.md b/README.md index b04370a3..c91338ec 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,9 @@ # Promenade Promenade is a tool for bootstrapping a resilient Kubernetes cluster and -managing its life-cycle. +managing its life-cycle via Helm charts. + +Documentation can be found [here](https://promenade.readthedocs.io). ## Roadmap @@ -21,9 +23,11 @@ The detailed Roadmap can be viewed on the ## Getting Started -To get started, see [getting started](docs/getting-started.md). +To get started, see +[getting started](https://promenade.readthedocs.io/en/latest/getting-started.html). -Configuration is documented [here](docs/configuration.md). +Configuration is documented +[here](https://promenade.readthedocs.io/en/latest/configuration/index.html). ## Bugs diff --git a/docs/source/design.rst b/docs/source/design.rst new file mode 100644 index 00000000..68845c92 --- /dev/null +++ b/docs/source/design.rst @@ -0,0 +1,229 @@ +Design +====== + +Promenade is a Kubernetes_ cluster deployment tool with the following goals: + +* Resiliency in the face of node loss and full cluster reboot. +* Bare metal node support without external runtime dependencies. +* Providing a fully functional single-node cluster to allow cluster-hosted + `tooling `_ to provision the + remaining cluster nodes. +* Helm_ chart managed component life-cycle. +* API-managed cluster life-cycle. + + +Cluster Bootstrapping +--------------------- + +The cluster is bootstrapped on a single node, called the genesis node. This +node goes through a short-lived bootstrapping phase driven by static pod +manifests consumed by ``kubelet``, then quickly moves to chart-managed +infrastructure, driven by Armada_. + +During the bootstrapping phase, the following temporary components are run as +static pods which are configured directly from Promenade's configuration +documents: + +* Kubernetes_ core components + + * ``apiserver`` + * ``controller-manager`` + * ``scheduler`` + +* Etcd_ for use by the Kubernetes_ ``apiserver`` +* Helm_'s server process ``tiller`` +* CoreDNS_ to be used for Kubernetes_ ``apiserver`` discovery + +With these components up, it is possible to leverage Armada_ to deploy Helm_ +charts to manage these components (and additional components) going forward. + +Though completely configurable, a typical Armada_ manifest should specify +charts for: + +* Kubernetes_ components + + * ``apiserver`` + * ``controller-manager`` + * ``proxy`` + * ``scheduler`` + +* Cluster DNS (e.g. CoreDNS_) +* Etcd_ for use by the Kubernetes_ ``apiserver`` +* A CNI_ provider for Kubernetes_ (e.g. Calico_) +* An initial under-cloud system to allow cluster expansion, including + components like Armada_, Deckhand_, Drydock_ and Shipyard_. + +Once these charts are deployed, the cluster is validated (currently, validation +is limited to resolving DNS queries and verifying basic Kubernetes +functionality including ``Pod`` scheduling log collection), and then the +genesis process is complete. Additional nodes can be added to the cluster +using day 2 procedures. + +After additional master nodes are added to the cluster, it is possible to +remove the genesis node from the cluster so that it can be fully re-provisioned +using the same process as for all the other nodes. + + +Life-cycle Management +--------------------- + +There are two sets of resources that require life-cycle management: cluster +nodes and Kubernetes_ control plane components. These two sets of resources +are managed differently. + + +Node Life-Cycle Management +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Node life-cycle management tools are provided via an API to be consumed by +other tools like Drydock_ and Shipyard_. + +The life-cycle operations for nodes are: + +1. Adding a node to the cluster +2. Removing a node from the cluster +3. Adding and removing node labels. + + +Adding a node to the cluster +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Adding a node to the cluster is done by running a shell script on the node that +installs the ``kubelet`` and configures it to find and join the cluster. This +script can either be generated up front via the CLI, or it can be obtained via +the `join-scripts` endpoint of the API (development of this API is in-progress). + +Nodes can only be joined assuming all the proper configuration documents are +available, including required certificates for Kubelet. + + +Removing a node from the cluster +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is currently possible by leveraging the ``promenade-teardown`` script +placed on each host. API support for this function is planned, but not yet +implemented. + +Adding and removing node labels +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is currently only possible directly via ``kubectl``, though API support +for this functionality is planned. + +It through relabeling nodes that key day 2 operations functionality like moving +a master node are achieved. + + +Control-Plane Component Life-Cycle Management +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +With the exception of the Docker_ daemon and the ``kubelet``, life-cycle +management of control plane components is handled via Helm_ chart updates, +which are orchestrated by Armada_. + +The Docker_ daemon is managed as an APT package, with configuration installed +at the time the node is configured to join the cluster. + +The ``kubelet`` is directly installed and configured at the time nodes join the +cluster. Work is in progress to improve the upgradability of ``kubelet`` via +either a system package or a chart. + + +Resiliency +---------- + +The two primary failure scenarios Promenade is designed to be resilient against +are node loss and full cluster restart. + +Kubernetes_ has a well-defined `High Availability +`_ pattern, which deals +well with node loss. + +However, this pattern requires an external load balancer for ``apiserver`` +discovery. Since it is a goal of this project for the cluster to be able to +operate without ongoing external dependencies, we must avoid that requirement. + +Additionally, in the event of full cluster restart, we cannot rely on any +response from the ``apiserver`` to give any ``kubelet`` direction on what +processes to run. That means, each master node must be self-sufficient, so +that once a quorum of Etcd_ members is achieved the cluster may resume normal +operation. + +The solution approach is two-pronged: + +1. Deploy a local discovery mechanism for the ``apiserver`` processes on each + node so that core components can always find the ``apiservers`` when their + nodes reboot. +2. Apply the Anchor pattern described below to ensure that essential components + on master nodes restart even when the ``apiservers`` are not available. + +Currently, the discovery mechanism for the ``apiserver`` processes is provided +by CoreDNS_ via a zone file written to disk on each node. This approach has +some drawbacks, which might be addressed in future work by leveraging a +HAProxy_ for discovery instead. + + +Anchor Pattern +^^^^^^^^^^^^^^ + +The anchor pattern provides a way to manage process life-cycle using Helm_ +charts in a way that allows them to be restarted immediately in the event of a +node restart -- even when the Kubernetes_ ``apiserver`` is unreachable. + +In this pattern, a ``DaemonSet`` called the ``anchor`` that runs on selected +nodes and is responsible for managing the life-cycle of assets deployed onto +the node file system. In particular, these assets include a Kubernetes_ +``Pod`` manifest to be consumed by ``kubelet`` and it manages the processes +specified by the ``Pod``. That management continues even when the node +reboots, since static pods like this are run by the ``kubelet`` even when the +``apiserver`` is not available. + +Cleanup of these resources is managed by the ``anchor`` pods' ``preStop`` +life-cycle hooks. This is usually simply removing the files originally placed +on the nodes' file systems, but, e.g. in the case of Etcd_, can actually be +used to manage more complex cleanup like removal from cluster membership. + + +Alternatives +------------ + +* Kubeadm_ + + * Does not yet support + `HA `_ + * Current approach to HA Etcd_ is to use the + `etcd opreator `_, which + recovers from cluster reboot by loading from an external backup snapshot + * Does not support chart-based management of components + +* kops_ + + * Does not support `bare metal `_ + +* Bootkube_ + + * Does not support automatic recovery from a + `full cluster reboot `_ + * Does not yet support + `full HA `_ + * Adheres to different design goals (minimal direct server contact), which + makes some of these changes challenging, e.g. + `building a self-contained, multi-master cluster `_ + * Does not support chart-based management of components + + +.. _Armada: https://github.com/att-comdev/armada +.. _Bootkube: https://github.com/kubernetes-incubator/bootkube +.. _CNI: https://github.com/containernetworking/cni +.. _Calico: https://github.com/projectcalico/calico +.. _CoreDNS: https://github.com/coredns/coredns +.. _Deckhand: https://github.com/att-comdev/deckhand +.. _Docker: https://www.docker.com +.. _Drydock: https://github.com/att-comdev/drydock +.. _Etcd: https://github.com/coreos/etcd +.. _HAProxy: http://www.haproxy.org +.. _Helm: https://github.com/kubernetes/helm +.. _kops: https://github.com/kubernetes/kops +.. _Kubeadm: https://github.com/kubernetes/kubeadm +.. _Kubernetes: https://github.com/kubernetes/kubernetes +.. _Shipyard: https://github.com/att-comdev/shipyard diff --git a/docs/source/index.rst b/docs/source/index.rst index 6f2854e8..8a7dc672 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -30,6 +30,7 @@ Promenade Configuration Guide .. toctree:: :maxdepth: 2 + design getting-started configuration/index api diff --git a/tox.ini b/tox.ini index fd0edbbe..7f85cb16 100644 --- a/tox.ini +++ b/tox.ini @@ -1,5 +1,5 @@ [tox] -envlist = bandit,lint +envlist = bandit,lint,docs [testenv] deps = -r{toxinidir}/requirements.txt