11 KiB
Design
Promenade is a Kubernetes cluster deployment tool with the following goals:
- Resiliency in the face of node loss and full cluster reboot.
- Bare metal node support without external runtime dependencies.
- Providing a fully functional single-node cluster to allow cluster-hosted tooling to provision the remaining cluster nodes.
- Helm chart managed component life-cycle.
- API-managed cluster life-cycle.
Cluster Bootstrapping
The cluster is bootstrapped on a single node, called the genesis
node. This node goes through a short-lived bootstrapping phase driven by
static pod manifests consumed by kubelet
, then quickly
moves to chart-managed infrastructure, driven by Armada.
During the bootstrapping phase, the following temporary components are run as static pods which are configured directly from Promenade's configuration documents:
Kubernetes core components
apiserver
controller-manager
scheduler
Etcd for use by the Kubernetes
apiserver
Helm's server process
tiller
CoreDNS to be used for Kubernetes
apiserver
discovery
With these components up, it is possible to leverage Armada to deploy Helm charts to manage these components (and additional components) going forward.
Though completely configurable, a typical Armada manifest should specify charts for:
Kubernetes components
apiserver
controller-manager
proxy
scheduler
Cluster DNS (e.g. CoreDNS)
Etcd for use by the Kubernetes
apiserver
A CNI provider for Kubernetes (e.g. Calico)
An initial under-cloud system to allow cluster expansion, including components like Armada, Deckhand, Drydock and Shipyard.
Once these charts are deployed, the cluster is validated (currently,
validation is limited to resolving DNS queries and verifying basic
Kubernetes functionality including Pod
scheduling log
collection), and then the genesis process is complete. Additional nodes
can be added to the cluster using day 2 procedures.
After additional master nodes are added to the cluster, it is possible to remove the genesis node from the cluster so that it can be fully re-provisioned using the same process as for all the other nodes.
Life-cycle Management
There are two sets of resources that require life-cycle management: cluster nodes and Kubernetes control plane components. These two sets of resources are managed differently.
Node Life-Cycle Management
Node life-cycle management tools are provided via an API to be consumed by other tools like Drydock and Shipyard.
The life-cycle operations for nodes are:
- Adding a node to the cluster
- Removing a node from the cluster
- Adding and removing node labels.
Adding a node to the cluster
Adding a node to the cluster is done by running a shell script on the
node that installs the kubelet
and configures it to find
and join the cluster. This script can either be generated up front via
the CLI, or it can be obtained via the join-scripts endpoint of the API (development
of this API is in-progress).
Nodes can only be joined assuming all the proper configuration documents are available, including required certificates for Kubelet.
Removing a node from the cluster
This is currently possible by leveraging the
promenade-teardown
script placed on each host. API support
for this function is planned, but not yet implemented.
Adding and removing node labels
Promenade provides node-labels
API for updating node
labels. For more information about updating node labels, please
reference the api-ref
.
It through relabeling nodes that key day 2 operations functionality like moving a master node are achieved.
Control-Plane Component Life-Cycle Management
With the exception of the Docker
daemon and the kubelet
, life-cycle management of control
plane components is handled via Helm chart updates, which
are orchestrated by Armada.
The Docker daemon is managed as an APT package, with configuration installed at the time the node is configured to join the cluster.
The kubelet
is directly installed and configured at the
time nodes join the cluster. Work is in progress to improve the
upgradability of kubelet
via either a system package or a
chart.
Resiliency
The two primary failure scenarios Promenade is designed to be resilient against are node loss and full cluster restart.
Kubernetes has a well-defined High Availability pattern, which deals well with node loss.
However, this pattern requires an external load balancer for
apiserver
discovery. Since it is a goal of this project for
the cluster to be able to operate without ongoing external dependencies,
we must avoid that requirement.
Additionally, in the event of full cluster restart, we cannot rely on
any response from the apiserver
to give any
kubelet
direction on what processes to run. That means,
each master node must be self-sufficient, so that once a quorum of Etcd members is achieved the
cluster may resume normal operation.
The solution approach is two-pronged:
- Deploy a local discovery mechanism for the
apiserver
processes on each node so that core components can always find theapiservers
when their nodes reboot. - Apply the Anchor pattern described below to ensure that essential
components on master nodes restart even when the
apiservers
are not available.
Currently, the discovery mechanism for the apiserver
processes is provided by CoreDNS via a zone file
written to disk on each node. This approach has some drawbacks, which
might be addressed in future work by leveraging a HAProxy for discovery instead.
Anchor Pattern
The anchor pattern provides a way to manage process life-cycle using
Helm charts in a way
that allows them to be restarted immediately in the event of a node
restart -- even when the Kubernetes
apiserver
is unreachable.
In this pattern, a DaemonSet
called the
anchor
that runs on selected nodes and is responsible for
managing the life-cycle of assets deployed onto the node file system. In
particular, these assets include a Kubernetes
Pod
manifest to be consumed by kubelet
and it
manages the processes specified by the Pod
. That management
continues even when the node reboots, since static pods like this are
run by the kubelet
even when the apiserver
is
not available.
Cleanup of these resources is managed by the anchor
pods' preStop
life-cycle hooks. This is usually simply
removing the files originally placed on the nodes' file systems, but,
e.g. in the case of Etcd,
can actually be used to manage more complex cleanup like removal from
cluster membership.
Pod Checkpointer
Before moving to the Anchor pattern above, the pod-checkpointer approach pioneered by the Bootkube project was implemented. While this is an appealing approach, it unfortunately suffers from race conditions during full cluster reboot.
During cluster reboot, the checkpointer copies essential static
manifests into place for the kubelet
to run, which allows
those components to start and become available. Once the
apiserver
and etcd
cluster are functional,
kubelet
is able to register the failure of its workloads,
and delete those pods via the API. This is where the race begins.
Once those pods are deleted from the apiserver
, the pod
checkpointer notices that the flagged pods are no longer scheduled to
run on its node and then deletes the static manifests for those pods.
Concurrently, the controller-manager
and
scheduler
notice that new pods need to be created and
scheduled (sequentially) and begin that work.
If the new pods are created, scheduled and started on the node before pod checkpointers on other nodes delete their critical services, then the cluster may remain healthy after the reboot. If enough nodes running the critical services fail to start the newly created pods before too many are removed, then the cluster does not recover from hard reboot.
The severity of this race is exacerbated by:
- The sequence of events required to successfully replace these pods
is long (
controller-manager
must create pods, thenscheduler
can schedule pods, thenkubelet
can start pods). - The
controller-manager
andscheduler
may need to perform leader election during the race, because the leader might have been killed early. - The failure to recover any one of the core sets of processes can
cause the entire cluster to fail. This is somewhat trajectory-dependent,
e.g. if at least one
controller-manager
is scheduled before thecontroller-manager
processes are all killed, then assuming the other processes are correctly restarted, then thecontroller-manager
will also recover. etcd
is somewhat more sensitive to this race, because it requires two successfully restarted pods (assuming a 3 node cluster) rather than just one as the other components.
This race condition was the motivation for the construction and use
of the Anchor pattern. In future versions of Kubernetes, it may
be possible to use built-in
checkpointing from the kubelet
.
Alternatives
-
- Does not yet support HA
- Current approach to HA Etcd is to use the etcd opreator, which recovers from cluster reboot by loading from an external backup snapshot
- Does not support chart-based management of components
-
- Does not support bare metal
-
- Does not support automatic recovery from a full cluster reboot
- Does not yet support full HA
- Adheres to different design goals (minimal direct server contact), which makes some of these changes challenging, e.g. building a self-contained, multi-master cluster
- Does not support chart-based management of components